Mattia,

Your enthusiasm is refreshing - more than 10 years ago, I presented a paper about dealing with "large" data sets using j - you can read it at

  http://www.jsoftware.com/papers/tuttle.htm

and there are several other interesting papers, including some about dealing with large collections of data at

  http://www.jsoftware.com/jwiki/Articles

Of course, what were large data sets in 1996 can be easily done in memory these days... After my talk in 1996, I continue to use j to look at largish collections of data - but none of the magnitude you speak of. Memory mapped files are a very powerful tool, and a 64 bit system should allow you to work directly with your large data sets - depending on how the files are organized, mapping can make provide "j like" structures that are pleasant to work with. If your files are encoded as relational databases, they may be difficult to process directly - but sometimes even such encoded files can be handled directly.

You are likely to get more opinions and answers if you give some example data and the kinds of analysis you want to perform. I assume you have discovered and experimented with things like aggregation /. in j. You would probably have to use some chunking of data to process the 7-13 Bbyte collections, but given generous memory and fast IO you should get good performance - guessing what the times might actually be is quite beyond me, but again, if you give some example data and what you are trying to extract, some forum members may have actual experience to speculate about your performance questions.

One thing that is almost always true is that finding good algorithms is important to avoid brute force (long/slow) solutions...

- joey


At 23:24  -0400 2008/03/30, Mattia Landoni wrote:
Hi all,

this is a narrowed-down version of an email I just sent to the general list
with the same subject

I am an economist and I discovered J a few days ago. I haven't been so
excited since when I was 13 and Santa brought me an 8-bit Nintendo
Entertainment System. Yet before taking a week off from work to study J
(just kidding) I would like to be sure it does everything I need. Here is
what concerns me the most.

- How does J deal with very large datasets? currently I am dealing with a
65-Gb dataset. So far only software I can use is SAS. Performing an SQL
query [SELECT, GROUP BY] in SAS on a dedicated server takes me six hours, of
which a large part of the time is network I/O (I guess SAS's computing time
would be an hour, perhaps two). The data is divided in 7 chunks of 7 to 13
Gb each. Having the same amount of data on a good computer, would I be able
to perform the same operations with J? Assume plentiful RAM and speedy
processor: what's the order of magnitude of the time it would take?
- I read something about memory mapping in past posts and I intuitively
understand what it means but I never did it. What are the limits of memory
mapping? In general, what are the techniques to deal with large datasets?

Any answer, hint, link,... most welcome.

Mattia

--
Mattia Landoni
1201 S Eads St Apt 417
Arlington, VA 22202-2837
USA
Greenwich -5 hours

Office: +1 202 62 35922
Cell: +1 202 492 3404
Home: +1 360 968 1684

Govern a great country as you would fry a small fish: do not poke at it too
much.
-- Lao Tzu
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to