For large text files, I've found perl to be quite efficient. One advantage the Unix commands have over J for machine performance is that they are stream-oriented. An array-oriented language like J tends to want to have the whole array available at once.
I actually have a large file splitter in both perl and J - the J uses the bigfiles utilities and runs maybe 25% faster than the perl, but the perl was easier to write - and has better platform-independence than this piece of J code - because of the bigfiles wrinkle. On Thu, Aug 27, 2009 at 6:17 AM, Matthew Brand <[email protected]>wrote: > I am using 64 bit linux so do not run into any file size issues. It > appears that the whole file is read into memory (i.e. swap disk) > before any operations are carried out. It might me more efficient to > use mapped files. > > Splitting into many smaller files takes less time because at no point > does the program have to use the swap disk. I agree that on a machine > with much larger ram it would probably not make a difference. > > I don't know the details, but I wonder how the unix gawk command > manages to trundle through huge data files a line at a time seemingly > efficiently, could J do it in a similar way (what ever that is!)? > > > > 2009/8/27 R.E. Boss <[email protected]>: > > Link should be > > http://www.jsoftware.com/jwiki/Scripts/Working%20with%20Big%20Files > > > > > > R.E. Boss > > > > > >> -----Oorspronkelijk bericht----- > >> Van: [email protected] [mailto:programming- > >> [email protected]] Namens Devon McCormick > >> Verzonden: donderdag 27 augustus 2009 3:56 > >> Aan: Programming forum > >> Onderwerp: Re: [Jprogramming] streaming through a large text file > >> > >> These could be made to work on files >4GB using the bigfiles code > (Windows > >> only) but they would have to be re-written to do that. You'd have to > use > >> "bixread" instead of 1!:11 and deal with extended integers - see > >> http://www.jsoftware.com/jwiki/Scripts/Working with Big Files for more > on > >> this if you're interested. > >> > >> On Wed, Aug 26, 2009 at 6:42 PM, Sherlock, Ric > >> <[email protected]>wrote: > >> > >> > Is the reason that fapplylines & freadblock doesn't work on files >4GB > >> > because a 32bit system can't represent the index into the file as an > >> 32bit > >> > integer? > >> > In other words they may well work OK on a 64bit system? > >> > > >> > I think bigfiles.ijs is Windows only? It so it would be an alternative > >> if > >> > using a 32bit Windows system, but it sounds like Matthew is on Linux. > >> > > >> > > From: Don Guinn > >> > > > >> > > Use bigfiles.ijs > >> > > > >> > > On Wed, Aug 26, 2009 at 4:09 PM, Devon McCormick wrote: > >> > > > >> > > > I thought I'd try this code but it doesn't work with very large > >> files > >> > > (>4 > >> > > > GB). > >> > > > > >> > > > On Wed, Aug 26, 2009 at 11:46 AM, R.E. Boss wrote: > >> > > > > >> > > > > > Chris Burke wrote: > >> > > > > > > Matthew Brand wrote: > >> > > > > > > Thanks for the links. I tried the fapplylines adverb but the > >> > > computer > >> > > > > > > grinds along for 30 minutes or so before I pulled the plug. > It > >> > > ends > >> > > > up > >> > > > > > > using 10Gb of (mainly virtual) memory. There are 40M lines > in > >> > > my > >> > > > file. > >> > > > > > > > >> > > > > > > I will use the unix split command to make lots of little > files > >> > > and > >> > > > > > > (myverb fapplylines)&.> fname to solve the problem. > >> > > > > > > >> > > > > > There should be little difference between processing lots of > >> > > small > >> > > > > > files, and one big file in chunks. > >> > > > > > > >> > > > > > What processing is being done? What result is being > accumulated? > >> > > > > > > >> > > > > > Why not test on a small file first and find out what is taking > >> > > time - > >> > > > > > and only then try on the full file? > >> > > > > > >> > > > > > >> > > > > My guess is we can improve the efficiency of your code by at > least > >> > > a > >> > > > factor > >> > > > > 2 (= Hui's constant). > >> > > > > > >> > > >> > ---------------------------------------------------------------------- > >> > For information about J forums see > http://www.jsoftware.com/forums.htm > >> > > >> > >> > >> > >> -- > >> Devon McCormick, CFA > >> ^me^ at acm. > >> org is my > >> preferred e-mail > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > -- Devon McCormick, CFA ^me^ at acm. org is my preferred e-mail ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
