Re: [Jprogramming] streaming through a large text file

David Mitchell Thu, 27 Aug 2009 18:18:16 -0700

I created a 5 gigabyte file with 40e6 lines of length 100-150:

NB. Test data creation


require 'files'
require 'bigfiles'

test5g=: 3 : 0
qq=.100+?(40e6#50)
'' fappend 'e:\5gtst'
for_i. i. 40e6%1000 do.
  (;(CRLF,~])&.>(1000{.(i*1000)}.qq){.&.><'This is a test of a long and similar 
string.') bappend_jbf_ 'e:\5gtst'
end.
)

I ran a test using OpenSUSE64:

addc=: 3 : 0
c=:c+#y
)

c=:0

1 addc fapplylines '/windows/e/5gtst'

This completed in about 7-11 minutes, with c containing the correct total of 
bytes in the file.

I ran a test using Vista32 and the following added to bigfiles:

b32to64=: 3 : 0
K32#.|:K32&|y
)

NB. =========================================================
NB.*bfapplylines a apply verb to lines in file delimited by LF
NB.
NB. form:
NB.     lineproc bfapplylines file  NB. line terminators removed
NB.   1 lineproc bfapplylines file  NB. line terminators preserved
bfapplylines=: 1 : 0
0 u bfapplylines_jbf_ y
:
y=. 8 u: y
s=. bfsize_jbf_ y
if. 32=3!:0 s do. return. end.
s=.b32to64_jbf_ s
p=. 0
dat=. ''
while. p < s do.
   b=. 1e6 <. s-p
   dat=. dat, bixread_jbf_ y;(b64to32_jbf_ p),b
   p=. p + b
   if. p = s do.
     len=. #dat=. dat, LF -. {:dat
   elseif. (#dat) < len=. 1 + dat i:LF do.
     'file not in LF-delimited lines' 13!:8[3
   end.
   if. x do.
     u ;.2 len {. dat
   else.
     u ;._2 CR -.~ len {. dat
   end.
   dat=. len }. dat
end.
)

1 addc bfapplylines 'e:\5gtst'

This completed in about 7-11 minutes, with c containing the correct total of 
bytes in the file.
--
David Mitchell


Matthew Brand wrote:
> I am using 64 bit linux so do not run into any file size issues. It
> appears that the whole file is read into memory (i.e. swap disk)
> before any operations are carried out. It might me more efficient to
> use mapped files.
> 
> Splitting into many smaller files takes less time because at no point
> does the program have to use the swap disk. I agree that on a machine
> with much larger ram it would probably not make a difference.
> 
> I don't know the details, but I wonder how the unix gawk command
> manages to trundle through huge data files a line at a time seemingly
> efficiently, could J do it in a similar way (what ever that is!)?
> 
> 
> 
> 2009/8/27 R.E. Boss <[email protected]>:
>> Link should be
>> http://www.jsoftware.com/jwiki/Scripts/Working%20with%20Big%20Files
>>
>>
>> R.E. Boss
>>
>>
>>> -----Oorspronkelijk bericht-----
>>> Van: [email protected] [mailto:programming-
>>> [email protected]] Namens Devon McCormick
>>> Verzonden: donderdag 27 augustus 2009 3:56
>>> Aan: Programming forum
>>> Onderwerp: Re: [Jprogramming] streaming through a large text file
>>>
>>> These could be made to work on files >4GB using the bigfiles code (Windows
>>> only) but they would have to be re-written to do that.  You'd have to use
>>> "bixread" instead of 1!:11 and deal with extended integers - see
>>> http://www.jsoftware.com/jwiki/Scripts/Working with Big Files for more on
>>> this if you're interested.
>>>
>>> On Wed, Aug 26, 2009 at 6:42 PM, Sherlock, Ric
>>> <[email protected]>wrote:
>>>
>>>> Is the reason that fapplylines & freadblock doesn't work on files >4GB
>>>> because a 32bit system can't represent the index into the file as an
>>> 32bit
>>>> integer?
>>>> In other words they may well work OK on a 64bit system?
>>>>
>>>> I think bigfiles.ijs is Windows only? It so it would be an alternative
>>> if
>>>> using a 32bit Windows system, but it sounds like Matthew is on Linux.
>>>>
>>>>> From: Don Guinn
>>>>>
>>>>> Use bigfiles.ijs
>>>>>
>>>>> On Wed, Aug 26, 2009 at 4:09 PM, Devon McCormick wrote:
>>>>>
>>>>>> I thought I'd try this code but it doesn't work with very large
>>> files
>>>>> (>4
>>>>>> GB).
>>>>>>
>>>>>> On Wed, Aug 26, 2009 at 11:46 AM, R.E. Boss wrote:
>>>>>>
>>>>>>>> Chris Burke wrote:
>>>>>>>>> Matthew Brand wrote:
>>>>>>>>> Thanks for the links. I tried the fapplylines adverb but the
>>>>> computer
>>>>>>>>> grinds along for 30 minutes or so before I pulled the plug. It
>>>>> ends
>>>>>> up
>>>>>>>>> using 10Gb of (mainly virtual) memory. There are 40M lines in
>>>>> my
>>>>>> file.
>>>>>>>>> I will use the unix split command to make lots of little files
>>>>> and
>>>>>>>>> (myverb fapplylines)&.> fname to solve the problem.
>>>>>>>> There should be little difference between processing lots of
>>>>> small
>>>>>>>> files, and one big file in chunks.
>>>>>>>>
>>>>>>>> What processing is being done? What result is being accumulated?
>>>>>>>>
>>>>>>>> Why not test on a small file first and find out what is taking
>>>>> time -
>>>>>>>> and only then try on the full file?
>>>>>>>
>>>>>>> My guess is we can improve the efficiency of your code by at least
>>>>> a
>>>>>> factor
>>>>>>> 2 (= Hui's constant).
>>>>>>>
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>
>>>
>>>
>>> --
>>> Devon McCormick, CFA
>>> ^me^ at acm.
>>> org is my
>>> preferred e-mail
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] streaming through a large text file

Reply via email to