Kai, Yes, I'm going to need to use the /seek option. I was trying to avoid it but it looks like it is the only way to go.
The records that I am working with although not fixed width are tab delimited. I could likely come up with a way to work on the fixed record size using skip etc, but think it may be just as easy to manage by checking if the last character of the block is a #"^/", and if not ignoring that record, then starting the next block with the start of this record. I should be able to do that easily enough using 'index?. I've been playing with it a little and looks very feasible to implement with minimal pain. Whether it will slow it down or not isn't too big a concern. Cheers, and thanks for your reply. Brock -----Original Message----- From: [email protected] [mailto:[EMAIL PROTECTED] On Behalf Of Kai Peters Sent: August 11, 2008 7:59 PM To: Brock Kalef Subject: [REBOL] Re: Working with large files On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote: > I'm looking to read 800+ MB web log files and process the log prior=20 > to=3D running through an > analysis tool. I'm running into "Out of Memory" errors and the odd=20 > Rebol=3D Crash in attempting to > do this. > > I started out simply reading the data directly into a word and=20 > looping=3D through the data. This > worked great for the sample data set of 45 MB. this then failed on a=20 > 430+=3D MB file. i.e.. data: > read/lines %file-name.log > > I then changed the direct read to use a port i.e.. data-port: open/lines=3D %file-name.log. This > worked for the 430+ MB file but then I started getting the errors=20 > again=3D for the 800+ MB files. > > It's now obvious that I will need to read in portions of the file at=20 > a=3D time. However, I am > unsure how to do this while also ensuring I get all the data. As you=20 > can=3D see from my earlier > example code, I'm interested in reading a line at a time for=20 > simplicity in=3D processing the records > as they are not fixed width (vary in length). My fear is that I will=20 > not=3D be able to properly > handle the records that are truncated due to the size of the data=20 > block I=3D retrieve from the file. > Or atleast not be able to do this easily. Are there any suggestions? > > My guess is that I will need to; > - pull in a fixed length block of data > - read to the data until I reach the first occurrence of a newline -=20 > =3D track the index of the > location of the newline > - continue reading the data until I reach the end of the data-block - > =3D once reaching the end of > the data retrieved, calculate where the last record process ended - =20 > read=3D the next data block > from that point - continue until reaching the end of file > > Any other suggestions? > > Regards, > Brock Kalef Sounds like a plan to me. Just ran this on a 1.9 GB file and it was=3D surprisingly fast (kept my HD busy for sure): port: open/seek %/c/apache.log chunksize: 1'048'576 ; 1 MB chunks forskip port chunksize [ chunk: copy/part port chunksize ] close port Do you really need to process it line by line though? That would really slow=3D it down.=20 Sure you cannot operate on the chunks in their entirety somehow? Cheers, Kai -- To unsubscribe from the list, just send an email to lists at rebol.com with unsubscribe as the subject. -- To unsubscribe from the list, just send an email to lists at rebol.com with unsubscribe as the subject.
