Re: [Jchat] J programming for big data

Don Guinn Wed, 07 Sep 2011 15:39:17 -0700

I think that memory mapped files become extensions to the swap file. As such
two things happen. One is that the paging handles reading and writing from
and to the file. The other is that the entire file is mapped into virtual
storage, though not read from the file until needed. With reading the file
all of the file is read through real memory then written to the swap file if
there is not enough real memory. But once read it is not any different than
memory mapped. Where memory mapped files can really be beneficial is if one
only needs to reference only small parts of a file, never even needing to
access most of the file. With memory mapped one can use J's indexing etc.
instead of having to use read indexing. In either case if access is to the
whole file or randomly all over better have enough real memory to keep
swapping down. If not then it is a good idea to follow Ric's suggestion and
handle it in pieces.


With WIN32 the virtual storage available is only 2G. Can be a limiting
factor for very large files. If the file would take a significant portion of
that 2G one would need forget using mapped files. 64 bit systems don't have
that limit.

On Wed, Sep 7, 2011 at 3:56 PM, Ric Sherlock <[email protected]> wrote:

> I'm using J to work with Genomic data (often >700,000 SNPs) from
> thousands of animals. Depending on file format, file sizes can easily
> increase to many GBs.
>
> My colleagues use a variety of R, Java, Scala and C and they've
> generally been impressed with J's performance and development time and
> conciseness - most are still put off by the look of the code though.
>
> IMO, J's biggest strength relative to R is in its consistency and
> elegance of manipulating and summarizing data structures.
>
> I'd second Raul's advice about breaking the data up into blocks,
> otherwise you quickly run out of resources trying to work with a 30k
> by 700k matrix all at once. I have had a brief go at comparing the
> performance of routines that process blocks of a file at a time with
> memory-mapped files and didn't find memory-mapping to be very
> compelling. Especially because they increase complexity of the system
> - perhaps I need to experiment more though.
>
> On Thu, Sep 8, 2011 at 3:45 AM, Raul Miller <[email protected]> wrote:
> > On Wed, Sep 7, 2011 at 11:33 AM, Zheng, Xin (NIH) [C]
> > <[email protected]> wrote:
> >> I am from R and I've got tired of its performance. So I am looking for
> some other language. I wonder how the performance of J when analyzing big
> data(GB or even TB). Could anyone give an rough idea?
> >
> > The answer depends on your machine and your computations and your
> > operating system.
> >
> > For a very rough first approximation assume a factor of 5 overhead on
> > data structure being manipulated (since you typically need
> > intermediate results).  And assume that if your calculation requires
> > swap you will need a factor of 1000 extra time (though how slow
> > depends on how swap is implemented on your machine).
> >
> > For large calculations I usually like breaking things up into blocks
> > that easily fit into memory (blocks of 1e6 data elements often works
> > fine).
> >
> > You will probably want to use memory mapped files for large data
> structures.
> >
> > I have never tried TB files in J.  You may want to consider kx.com's
> > interpreter instead of J if you routinely work on that size of data --
> > their user community routinely works on very large data sets.  Expect
> > to pay a lot of money, though, if you go that route.
> >
> > --
> > Raul
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] J programming for big data

Reply via email to