Re: [Jprogramming] streaming through a large text file

Joey K Tuttle Thu, 27 Aug 2009 11:40:03 -0700

I have no experience with files large enough to require 64 bit 
addressing (in 1!:11), but have done a lot of processing of files up 
to 2 gigabytes. In the past, I ran some tests that indicated reading 
in more than a few thousand "lines" from a text file stream did not 
improve overall efficiency. In some cases, making my working buffer 
bigger than a few hundred thousand bytes actually took more time to 
process (likely because of moving large chunks of data around).

I have found working through line oriented files is reasonably 
efficient if one is careful. I developed my own simplistic function

    getlines =: 3 : 0
  100000 getlines y  NB. Default BS is 100,000 bytes
:
  bs =. x
  fs =. 1!:4< fn =. > 0{y
  fl =. bs <. fs -fp =. > 1{y
  buf =. 1!:11 fn;fp,fl
   if. (fs = fp =. fp + fl) do. fp =. _1 end.
  drop =. (<:#buf)-buf i: NL
   if. ((drop ~: 0) *. fp = _1 ) do. echo '** Unexpected EOF **' end.
  fp =. _1 >. fp - drop
  fn;fp;buf }.~ -drop
)

I initialize a working buffer -

   fbuf =: 'filename';0

and then call getlines in a loop of the form -

   whilst.  (0<>1{fbuf) do.
        process >2{ fbuf =: getlines fbuf
   end.

Mapping files is great, but the greatest benefits to j are when the 
data is "rectangular" (which is often the case in databases). I think 
there may be some glitches in 64 bit file mapping, but they should 
disappear as 64 bit support matures.

Things like gawk, likely written in C, see line at a time files as 
bread and butter. I've found that if a file can be mapped to a 
matrix, j can beat many well polished Unix routines. In one example I 
presented at the last j user conference (so long ago!), I showed j 
doing some rather complicated aggregations and calculations on 
largish phone call record files (about 225 megabytes in those cases 
mapped to a large table) faster than Unix wc -l  could count the 
lines. That got a nice cheer from the audience. :)

- joey

At 11:17  +0100 2009/08/27, Matthew Brand wrote:
>I am using 64 bit linux so do not run into any file size issues. It
>appears that the whole file is read into memory (i.e. swap disk)
>before any operations are carried out. It might me more efficient to
>use mapped files.
>
>Splitting into many smaller files takes less time because at no point
>does the program have to use the swap disk. I agree that on a machine
>with much larger ram it would probably not make a difference.
>
>I don't know the details, but I wonder how the unix gawk command
>manages to trundle through huge data files a line at a time seemingly
>efficiently, could J do it in a similar way (what ever that is!)?
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] streaming through a large text file

Reply via email to