When I wrote earlier about using "extended integers" to address beyond 2^32,
it occurred to me that this is what "bigfiles" does _not_ do: it uses a pair
of (signed) integers because that's what's needed by the Windows API.  I was
inspired by this notion to modify my old code to actually use J extended
integers externally as this is much neater than faking high numbers with a
pair of integers.

When I revamped my old code, I noticed that my special "self-numbered" test
file seems to have a coincidentally high proportion of integral powers of
ten, so I address this in the new version of the essay at "
http://www.jsoftware.com/jwiki/Scripts/Working%20with%20Big%20Files";.

On Thu, Aug 27, 2009 at 9:16 PM, David Mitchell <[email protected]>wrote:

> I created a 5 gigabyte file with 40e6 lines of length 100-150:
>
> NB. Test data creation
>
> require 'files'
> require 'bigfiles'
>
> test5g=: 3 : 0
> qq=.100+?(40e6#50)
> '' fappend 'e:\5gtst'
> for_i. i. 40e6%1000 do.
>  (;(CRLF,~])&.>(1000{.(i*1000)}.qq){.&.><'This is a test of a long and
> similar
> string.') bappend_jbf_ 'e:\5gtst'
> end.
> )
>
> I ran a test using OpenSUSE64:
>
> addc=: 3 : 0
> c=:c+#y
> )
>
> c=:0
>
> 1 addc fapplylines '/windows/e/5gtst'
>
> This completed in about 7-11 minutes, with c containing the correct total
> of
> bytes in the file.
>
> I ran a test using Vista32 and the following added to bigfiles:
>
> b32to64=: 3 : 0
> K32#.|:K32&|y
> )
>
> NB. =========================================================
> NB.*bfapplylines a apply verb to lines in file delimited by LF
> NB.
> NB. form:
> NB.     lineproc bfapplylines file  NB. line terminators removed
> NB.   1 lineproc bfapplylines file  NB. line terminators preserved
> bfapplylines=: 1 : 0
> 0 u bfapplylines_jbf_ y
> :
> y=. 8 u: y
> s=. bfsize_jbf_ y
> if. 32=3!:0 s do. return. end.
> s=.b32to64_jbf_ s
> p=. 0
> dat=. ''
> while. p < s do.
>   b=. 1e6 <. s-p
>   dat=. dat, bixread_jbf_ y;(b64to32_jbf_ p),b
>   p=. p + b
>   if. p = s do.
>     len=. #dat=. dat, LF -. {:dat
>   elseif. (#dat) < len=. 1 + dat i:LF do.
>     'file not in LF-delimited lines' 13!:8[3
>   end.
>   if. x do.
>     u ;.2 len {. dat
>   else.
>     u ;._2 CR -.~ len {. dat
>   end.
>   dat=. len }. dat
> end.
> )
>
> 1 addc bfapplylines 'e:\5gtst'
>
> This completed in about 7-11 minutes, with c containing the correct total
> of
> bytes in the file.
> --
> David Mitchell
>
>
> Matthew Brand wrote:
> > I am using 64 bit linux so do not run into any file size issues. It
> > appears that the whole file is read into memory (i.e. swap disk)
> > before any operations are carried out. It might me more efficient to
> > use mapped files.
> >
> > Splitting into many smaller files takes less time because at no point
> > does the program have to use the swap disk. I agree that on a machine
> > with much larger ram it would probably not make a difference.
> >
> > I don't know the details, but I wonder how the unix gawk command
> > manages to trundle through huge data files a line at a time seemingly
> > efficiently, could J do it in a similar way (what ever that is!)?
> >
> >
> >
> > 2009/8/27 R.E. Boss <[email protected]>:
> >> Link should be
> >> http://www.jsoftware.com/jwiki/Scripts/Working%20with%20Big%20Files
> >>
> >>
> >> R.E. Boss
> >>
> >>
> >>> -----Oorspronkelijk bericht-----
> >>> Van: [email protected] [mailto:programming-
> >>> [email protected]] Namens Devon McCormick
> >>> Verzonden: donderdag 27 augustus 2009 3:56
> >>> Aan: Programming forum
> >>> Onderwerp: Re: [Jprogramming] streaming through a large text file
> >>>
> >>> These could be made to work on files >4GB using the bigfiles code
> (Windows
> >>> only) but they would have to be re-written to do that.  You'd have to
> use
> >>> "bixread" instead of 1!:11 and deal with extended integers - see
> >>> http://www.jsoftware.com/jwiki/Scripts/Working with Big Files for more
> on
> >>> this if you're interested.
> >>>
> >>> On Wed, Aug 26, 2009 at 6:42 PM, Sherlock, Ric
> >>> <[email protected]>wrote:
> >>>
> >>>> Is the reason that fapplylines & freadblock doesn't work on files >4GB
> >>>> because a 32bit system can't represent the index into the file as an
> >>> 32bit
> >>>> integer?
> >>>> In other words they may well work OK on a 64bit system?
> >>>>
> >>>> I think bigfiles.ijs is Windows only? It so it would be an alternative
> >>> if
> >>>> using a 32bit Windows system, but it sounds like Matthew is on Linux.
> >>>>
> >>>>> From: Don Guinn
> >>>>>
> >>>>> Use bigfiles.ijs
> >>>>>
> >>>>> On Wed, Aug 26, 2009 at 4:09 PM, Devon McCormick wrote:
> >>>>>
> >>>>>> I thought I'd try this code but it doesn't work with very large
> >>> files
> >>>>> (>4
> >>>>>> GB).
> >>>>>>
> >>>>>> On Wed, Aug 26, 2009 at 11:46 AM, R.E. Boss wrote:
> >>>>>>
> >>>>>>>> Chris Burke wrote:
> >>>>>>>>> Matthew Brand wrote:
> >>>>>>>>> Thanks for the links. I tried the fapplylines adverb but the
> >>>>> computer
> >>>>>>>>> grinds along for 30 minutes or so before I pulled the plug. It
> >>>>> ends
> >>>>>> up
> >>>>>>>>> using 10Gb of (mainly virtual) memory. There are 40M lines in
> >>>>> my
> >>>>>> file.
> >>>>>>>>> I will use the unix split command to make lots of little files
> >>>>> and
> >>>>>>>>> (myverb fapplylines)&.> fname to solve the problem.
> >>>>>>>> There should be little difference between processing lots of
> >>>>> small
> >>>>>>>> files, and one big file in chunks.
> >>>>>>>>
> >>>>>>>> What processing is being done? What result is being accumulated?
> >>>>>>>>
> >>>>>>>> Why not test on a small file first and find out what is taking
> >>>>> time -
> >>>>>>>> and only then try on the full file?
> >>>>>>>
> >>>>>>> My guess is we can improve the efficiency of your code by at least
> >>>>> a
> >>>>>> factor
> >>>>>>> 2 (= Hui's constant).
> >>>>>>>
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>>
> >>>
> >>>
> >>> --
> >>> Devon McCormick, CFA
> >>> ^me^ at acm.
> >>> org is my
> >>> preferred e-mail
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>



-- 
Devon McCormick, CFA
^me^ at acm.
org is my
preferred e-mail
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to