Re: [Jchat] J programming for big data

Ric Sherlock Wed, 07 Sep 2011 14:57:26 -0700

I'm using J to work with Genomic data (often >700,000 SNPs) from
thousands of animals. Depending on file format, file sizes can easily
increase to many GBs.


My colleagues use a variety of R, Java, Scala and C and they've
generally been impressed with J's performance and development time and
conciseness - most are still put off by the look of the code though.

IMO, J's biggest strength relative to R is in its consistency and
elegance of manipulating and summarizing data structures.

I'd second Raul's advice about breaking the data up into blocks,
otherwise you quickly run out of resources trying to work with a 30k
by 700k matrix all at once. I have had a brief go at comparing the
performance of routines that process blocks of a file at a time with
memory-mapped files and didn't find memory-mapping to be very
compelling. Especially because they increase complexity of the system
- perhaps I need to experiment more though.

On Thu, Sep 8, 2011 at 3:45 AM, Raul Miller <[email protected]> wrote:
> On Wed, Sep 7, 2011 at 11:33 AM, Zheng, Xin (NIH) [C]
> <[email protected]> wrote:
>> I am from R and I've got tired of its performance. So I am looking for some 
>> other language. I wonder how the performance of J when analyzing big data(GB 
>> or even TB). Could anyone give an rough idea?
>
> The answer depends on your machine and your computations and your
> operating system.
>
> For a very rough first approximation assume a factor of 5 overhead on
> data structure being manipulated (since you typically need
> intermediate results).  And assume that if your calculation requires
> swap you will need a factor of 1000 extra time (though how slow
> depends on how swap is implemented on your machine).
>
> For large calculations I usually like breaking things up into blocks
> that easily fit into memory (blocks of 1e6 data elements often works
> fine).
>
> You will probably want to use memory mapped files for large data structures.
>
> I have never tried TB files in J.  You may want to consider kx.com's
> interpreter instead of J if you routinely work on that size of data --
> their user community routinely works on very large data sets.  Expect
> to pay a lot of money, though, if you go that route.
>
> --
> Raul
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] J programming for big data

Reply via email to