Mattia,
Your enthusiasm is refreshing - more than 10 years ago, I presented a
paper about dealing with "large" data sets using j - you can read it
at
http://www.jsoftware.com/papers/tuttle.htm
and there are several other interesting papers, including some about
dealing with large collections of data at
http://www.jsoftware.com/jwiki/Articles
Of course, what were large data sets in 1996 can be easily done in
memory these days... After my talk in 1996, I continue to use j to
look at largish collections of data - but none of the magnitude you
speak of. Memory mapped files are a very powerful tool, and a 64 bit
system should allow you to work directly with your large data sets -
depending on how the files are organized, mapping can make provide "j
like" structures that are pleasant to work with. If your files are
encoded as relational databases, they may be difficult to process
directly - but sometimes even such encoded files can be handled
directly.
You are likely to get more opinions and answers if you give some
example data and the kinds of analysis you want to perform. I assume
you have discovered and experimented with things like aggregation /.
in j. You would probably have to use some chunking of data to process
the 7-13 Bbyte collections, but given generous memory and fast IO you
should get good performance - guessing what the times might actually
be is quite beyond me, but again, if you give some example data and
what you are trying to extract, some forum members may have actual
experience to speculate about your performance questions.
One thing that is almost always true is that finding good algorithms
is important to avoid brute force (long/slow) solutions...
- joey
At 23:24 -0400 2008/03/30, Mattia Landoni wrote:
Hi all,
this is a narrowed-down version of an email I just sent to the general list
with the same subject
I am an economist and I discovered J a few days ago. I haven't been so
excited since when I was 13 and Santa brought me an 8-bit Nintendo
Entertainment System. Yet before taking a week off from work to study J
(just kidding) I would like to be sure it does everything I need. Here is
what concerns me the most.
- How does J deal with very large datasets? currently I am dealing with a
65-Gb dataset. So far only software I can use is SAS. Performing an SQL
query [SELECT, GROUP BY] in SAS on a dedicated server takes me six hours, of
which a large part of the time is network I/O (I guess SAS's computing time
would be an hour, perhaps two). The data is divided in 7 chunks of 7 to 13
Gb each. Having the same amount of data on a good computer, would I be able
to perform the same operations with J? Assume plentiful RAM and speedy
processor: what's the order of magnitude of the time it would take?
- I read something about memory mapping in past posts and I intuitively
understand what it means but I never did it. What are the limits of memory
mapping? In general, what are the techniques to deal with large datasets?
Any answer, hint, link,... most welcome.
Mattia
--
Mattia Landoni
1201 S Eads St Apt 417
Arlington, VA 22202-2837
USA
Greenwich -5 hours
Office: +1 202 62 35922
Cell: +1 202 492 3404
Home: +1 360 968 1684
Govern a great country as you would fry a small fish: do not poke at it too
much.
-- Lao Tzu
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm