[REBOL] Re: Working with large files

Kai Peters Mon, 11 Aug 2008 16:59:49 -0700

On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:
> I'm looking to read 800+ MB web log files and process the log prior to=
 running through an
> analysis tool.  I'm running into "Out of Memory" errors and the odd Rebol=
 Crash in attempting to
> do this.
>
> I started out simply reading the data directly into a word and looping=
 through the data.  This
> worked great for the sample data set of 45 MB. this then failed on a 430+=
 MB file.  i.e..  data:
> read/lines %file-name.log
>
> I then changed the direct read to use a port i.e..   data-port: open/lines=
 %file-name.log.   This
> worked for the 430+ MB file but then I started getting the errors again=
 for the 800+ MB files.
>
> It's now obvious that I will need to read in portions of the file at a=
 time.  However, I am
> unsure how to do this while also ensuring I get all the data.  As you can=
 see from my earlier
> example code, I'm interested in reading a line at a time for simplicity in=
 processing the records
> as they are not fixed width (vary in length).  My fear is that I will not=
 be able to properly
> handle the records that are truncated due to the size of the data block I=
 retrieve from the file.
> Or atleast not be able to do this easily.  Are there any suggestions?
>
> My guess is that I will need to;
> -  pull in a fixed length block of data
> -  read to the data until I reach the first occurrence of a newline - =
 track the index of the
> location of the newline
> -  continue reading the data until I reach the end of the data-block - =
 once reaching the end of
> the data retrieved, calculate where the last record process ended -  read=
 the next data block
> from that point -  continue until reaching the end of file
>
> Any other suggestions?
>
> Regards,
> Brock Kalef



Sounds like a plan to me. Just ran this on a 1.9 GB file and it was=
 surprisingly fast (kept my HD 
busy for sure):

port: open/seek %/c/apache.log
chunksize: 1'048'576  ; 1 MB chunks
forskip port chunksize [
  chunk: copy/part port chunksize
]
close port

Do you really need to process it line by line though? That would really slow=
 it down. 
Sure you cannot operate on the chunks in their entirety somehow?

Cheers,
Kai
-- 
To unsubscribe from the list, just send an email to 
lists at rebol.com with unsubscribe as the subject.

[REBOL] Re: Working with large files

Reply via email to