When I wrote earlier about using "extended integers" to address beyond 2^32, it occurred to me that this is what "bigfiles" does _not_ do: it uses a pair of (signed) integers because that's what's needed by the Windows API. I was inspired by this notion to modify my old code to actually use J extended integers externally as this is much neater than faking high numbers with a pair of integers.
When I revamped my old code, I noticed that my special "self-numbered" test file seems to have a coincidentally high proportion of integral powers of ten, so I address this in the new version of the essay at " http://www.jsoftware.com/jwiki/Scripts/Working%20with%20Big%20Files". On Thu, Aug 27, 2009 at 9:16 PM, David Mitchell <[email protected]>wrote: > I created a 5 gigabyte file with 40e6 lines of length 100-150: > > NB. Test data creation > > require 'files' > require 'bigfiles' > > test5g=: 3 : 0 > qq=.100+?(40e6#50) > '' fappend 'e:\5gtst' > for_i. i. 40e6%1000 do. > (;(CRLF,~])&.>(1000{.(i*1000)}.qq){.&.><'This is a test of a long and > similar > string.') bappend_jbf_ 'e:\5gtst' > end. > ) > > I ran a test using OpenSUSE64: > > addc=: 3 : 0 > c=:c+#y > ) > > c=:0 > > 1 addc fapplylines '/windows/e/5gtst' > > This completed in about 7-11 minutes, with c containing the correct total > of > bytes in the file. > > I ran a test using Vista32 and the following added to bigfiles: > > b32to64=: 3 : 0 > K32#.|:K32&|y > ) > > NB. ========================================================= > NB.*bfapplylines a apply verb to lines in file delimited by LF > NB. > NB. form: > NB. lineproc bfapplylines file NB. line terminators removed > NB. 1 lineproc bfapplylines file NB. line terminators preserved > bfapplylines=: 1 : 0 > 0 u bfapplylines_jbf_ y > : > y=. 8 u: y > s=. bfsize_jbf_ y > if. 32=3!:0 s do. return. end. > s=.b32to64_jbf_ s > p=. 0 > dat=. '' > while. p < s do. > b=. 1e6 <. s-p > dat=. dat, bixread_jbf_ y;(b64to32_jbf_ p),b > p=. p + b > if. p = s do. > len=. #dat=. dat, LF -. {:dat > elseif. (#dat) < len=. 1 + dat i:LF do. > 'file not in LF-delimited lines' 13!:8[3 > end. > if. x do. > u ;.2 len {. dat > else. > u ;._2 CR -.~ len {. dat > end. > dat=. len }. dat > end. > ) > > 1 addc bfapplylines 'e:\5gtst' > > This completed in about 7-11 minutes, with c containing the correct total > of > bytes in the file. > -- > David Mitchell > > > Matthew Brand wrote: > > I am using 64 bit linux so do not run into any file size issues. It > > appears that the whole file is read into memory (i.e. swap disk) > > before any operations are carried out. It might me more efficient to > > use mapped files. > > > > Splitting into many smaller files takes less time because at no point > > does the program have to use the swap disk. I agree that on a machine > > with much larger ram it would probably not make a difference. > > > > I don't know the details, but I wonder how the unix gawk command > > manages to trundle through huge data files a line at a time seemingly > > efficiently, could J do it in a similar way (what ever that is!)? > > > > > > > > 2009/8/27 R.E. Boss <[email protected]>: > >> Link should be > >> http://www.jsoftware.com/jwiki/Scripts/Working%20with%20Big%20Files > >> > >> > >> R.E. Boss > >> > >> > >>> -----Oorspronkelijk bericht----- > >>> Van: [email protected] [mailto:programming- > >>> [email protected]] Namens Devon McCormick > >>> Verzonden: donderdag 27 augustus 2009 3:56 > >>> Aan: Programming forum > >>> Onderwerp: Re: [Jprogramming] streaming through a large text file > >>> > >>> These could be made to work on files >4GB using the bigfiles code > (Windows > >>> only) but they would have to be re-written to do that. You'd have to > use > >>> "bixread" instead of 1!:11 and deal with extended integers - see > >>> http://www.jsoftware.com/jwiki/Scripts/Working with Big Files for more > on > >>> this if you're interested. > >>> > >>> On Wed, Aug 26, 2009 at 6:42 PM, Sherlock, Ric > >>> <[email protected]>wrote: > >>> > >>>> Is the reason that fapplylines & freadblock doesn't work on files >4GB > >>>> because a 32bit system can't represent the index into the file as an > >>> 32bit > >>>> integer? > >>>> In other words they may well work OK on a 64bit system? > >>>> > >>>> I think bigfiles.ijs is Windows only? It so it would be an alternative > >>> if > >>>> using a 32bit Windows system, but it sounds like Matthew is on Linux. > >>>> > >>>>> From: Don Guinn > >>>>> > >>>>> Use bigfiles.ijs > >>>>> > >>>>> On Wed, Aug 26, 2009 at 4:09 PM, Devon McCormick wrote: > >>>>> > >>>>>> I thought I'd try this code but it doesn't work with very large > >>> files > >>>>> (>4 > >>>>>> GB). > >>>>>> > >>>>>> On Wed, Aug 26, 2009 at 11:46 AM, R.E. Boss wrote: > >>>>>> > >>>>>>>> Chris Burke wrote: > >>>>>>>>> Matthew Brand wrote: > >>>>>>>>> Thanks for the links. I tried the fapplylines adverb but the > >>>>> computer > >>>>>>>>> grinds along for 30 minutes or so before I pulled the plug. It > >>>>> ends > >>>>>> up > >>>>>>>>> using 10Gb of (mainly virtual) memory. There are 40M lines in > >>>>> my > >>>>>> file. > >>>>>>>>> I will use the unix split command to make lots of little files > >>>>> and > >>>>>>>>> (myverb fapplylines)&.> fname to solve the problem. > >>>>>>>> There should be little difference between processing lots of > >>>>> small > >>>>>>>> files, and one big file in chunks. > >>>>>>>> > >>>>>>>> What processing is being done? What result is being accumulated? > >>>>>>>> > >>>>>>>> Why not test on a small file first and find out what is taking > >>>>> time - > >>>>>>>> and only then try on the full file? > >>>>>>> > >>>>>>> My guess is we can improve the efficiency of your code by at least > >>>>> a > >>>>>> factor > >>>>>>> 2 (= Hui's constant). > >>>>>>> > >>>> ---------------------------------------------------------------------- > >>>> For information about J forums see > http://www.jsoftware.com/forums.htm > >>>> > >>> > >>> > >>> -- > >>> Devon McCormick, CFA > >>> ^me^ at acm. > >>> org is my > >>> preferred e-mail > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > -- Devon McCormick, CFA ^me^ at acm. org is my preferred e-mail ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
