[REBOL] Re: Working with large files

Brock Kalef Tue, 12 Aug 2008 06:06:27 -0700

Kai,
Yes, I'm going to need to use the /seek option.  I was trying to avoid
it but it looks like it is the only way to go.


The records that I am working with although not fixed width are tab
delimited.  I could likely come up with a way to work on the fixed
record size using skip etc, but think it may be just as easy to manage
by checking if the last character of the block is a #"^/", and if not
ignoring that record, then starting the next block with the start of
this record.  I should be able to do that easily enough using 'index?.
I've been playing with it a little and looks very feasible to implement
with minimal pain.  Whether it will slow it down or not isn't too big a
concern.

Cheers, and thanks for your reply.

Brock

-----Original Message-----
From: [email protected] [mailto:[EMAIL PROTECTED] On Behalf
Of Kai Peters
Sent: August 11, 2008 7:59 PM
To: Brock Kalef
Subject: [REBOL] Re: Working with large files


On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:
> I'm looking to read 800+ MB web log files and process the log prior=20
> to=3D
 running through an
> analysis tool.  I'm running into "Out of Memory" errors and the odd=20
> Rebol=3D
 Crash in attempting to
> do this.
>
> I started out simply reading the data directly into a word and=20
> looping=3D
 through the data.  This
> worked great for the sample data set of 45 MB. this then failed on a=20
> 430+=3D
 MB file.  i.e..  data:
> read/lines %file-name.log
>
> I then changed the direct read to use a port i.e..   data-port:
open/lines=3D
 %file-name.log.   This
> worked for the 430+ MB file but then I started getting the errors=20
> again=3D
 for the 800+ MB files.
>
> It's now obvious that I will need to read in portions of the file at=20
> a=3D
 time.  However, I am
> unsure how to do this while also ensuring I get all the data.  As you=20
> can=3D
 see from my earlier
> example code, I'm interested in reading a line at a time for=20
> simplicity in=3D
 processing the records
> as they are not fixed width (vary in length).  My fear is that I will=20
> not=3D
 be able to properly
> handle the records that are truncated due to the size of the data=20
> block I=3D
 retrieve from the file.
> Or atleast not be able to do this easily.  Are there any suggestions?
>
> My guess is that I will need to;
> -  pull in a fixed length block of data
> -  read to the data until I reach the first occurrence of a newline -=20
> =3D
 track the index of the
> location of the newline
> -  continue reading the data until I reach the end of the data-block -

> =3D
 once reaching the end of
> the data retrieved, calculate where the last record process ended - =20
> read=3D
 the next data block
> from that point -  continue until reaching the end of file
>
> Any other suggestions?
>
> Regards,
> Brock Kalef


Sounds like a plan to me. Just ran this on a 1.9 GB file and it was=3D
surprisingly fast (kept my HD busy for sure):

port: open/seek %/c/apache.log
chunksize: 1'048'576  ; 1 MB chunks
forskip port chunksize [
  chunk: copy/part port chunksize
]
close port

Do you really need to process it line by line though? That would really
slow=3D  it down.=20
Sure you cannot operate on the chunks in their entirety somehow?

Cheers,
Kai
--
To unsubscribe from the list, just send an email to lists at rebol.com
with unsubscribe as the subject.

-- 
To unsubscribe from the list, just send an email to 
lists at rebol.com with unsubscribe as the subject.

[REBOL] Re: Working with large files

Reply via email to