I have no experience with files large enough to require 64 bit
addressing (in 1!:11), but have done a lot of processing of files up
to 2 gigabytes. In the past, I ran some tests that indicated reading
in more than a few thousand "lines" from a text file stream did not
improve overall efficiency. In some cases, making my working buffer
bigger than a few hundred thousand bytes actually took more time to
process (likely because of moving large chunks of data around).
I have found working through line oriented files is reasonably
efficient if one is careful. I developed my own simplistic function
getlines =: 3 : 0
100000 getlines y NB. Default BS is 100,000 bytes
:
bs =. x
fs =. 1!:4< fn =. > 0{y
fl =. bs <. fs -fp =. > 1{y
buf =. 1!:11 fn;fp,fl
if. (fs = fp =. fp + fl) do. fp =. _1 end.
drop =. (<:#buf)-buf i: NL
if. ((drop ~: 0) *. fp = _1 ) do. echo '** Unexpected EOF **' end.
fp =. _1 >. fp - drop
fn;fp;buf }.~ -drop
)
I initialize a working buffer -
fbuf =: 'filename';0
and then call getlines in a loop of the form -
whilst. (0<>1{fbuf) do.
process >2{ fbuf =: getlines fbuf
end.
Mapping files is great, but the greatest benefits to j are when the
data is "rectangular" (which is often the case in databases). I think
there may be some glitches in 64 bit file mapping, but they should
disappear as 64 bit support matures.
Things like gawk, likely written in C, see line at a time files as
bread and butter. I've found that if a file can be mapped to a
matrix, j can beat many well polished Unix routines. In one example I
presented at the last j user conference (so long ago!), I showed j
doing some rather complicated aggregations and calculations on
largish phone call record files (about 225 megabytes in those cases
mapped to a large table) faster than Unix wc -l could count the
lines. That got a nice cheer from the audience. :)
- joey
At 11:17 +0100 2009/08/27, Matthew Brand wrote:
>I am using 64 bit linux so do not run into any file size issues. It
>appears that the whole file is read into memory (i.e. swap disk)
>before any operations are carried out. It might me more efficient to
>use mapped files.
>
>Splitting into many smaller files takes less time because at no point
>does the program have to use the swap disk. I agree that on a machine
>with much larger ram it would probably not make a difference.
>
>I don't know the details, but I wonder how the unix gawk command
>manages to trundle through huge data files a line at a time seemingly
>efficiently, could J do it in a similar way (what ever that is!)?
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm