Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Paul DeBruicker Fri, 14 Nov 2014 13:09:32 -0800

Hi Sven,

Thanks for taking a look and testing the NeoCSVReader portion for me. 
You're right of course that there's something I'm doing that's slow.  But. 
There is something I can't figure out yet.


To provide a little more detail:

When the 'csv reading' process completes successfully profiling shows that
most of the time is spent in NeoCSVReader>>#peekChar and using
NeoCSVReader>>##addField: to convert a string to a DateAndTime.  Dropping
the DateAndTime conversion speeds things up but doesn't stop it from running
out of memory.  

I start the image with 

./pharo-ui --memory 1000m myimage.image   

Splitting the CSV file helps:
~1.5MB  5,000 lines = 1.2 seconds.
~15MB   50,000 lines = 8 seconds.
~30MB   100,000 lines = 16 seconds.
~60MB   200,000 lines  = 45 seconds.
  

It seems that when the CSV file crosses ~70MB in size things start going
haywire with performance, and leads to the out of memory condition.  The
processing never ends.  Sending "kill -SIGUSR1" prints a stack primarily
composed of:

0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
OutOfMemory class
0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
OutOfMemory class
0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
class
0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
OutOfMemory class
0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
OutOfMemory class
0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
class
0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
OutOfMemory class

So it seems like its trying to signal that its out of memory after its out
of memory which triggers another OutOfMemory error.  So that's why progress
stops.  


** Aside - OutOfMemory should probably be refactored to be able to signal
itself without taking up more memory, triggering itself infinitely.  Maybe
it & its signalling morph infrastructure would be good as a singleton **



I'm confused about why it runs out of memory.  According to htop the image
only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
condition.  This Macbook Air laptop has 4GB, and has plenty of room for the
image to grow.  Also I've specified a 1,000MB image size when starting.  So
it should have plenty of room.  Is there something I should check or a flag
somewhere that prevents it from growing on a Mac?  This is the latest
Pharo30 VM.  


Thanks for helping me get to the bottom of this

Paul















Sven Van Caekenberghe-2 wrote
> Hi Paul,
> 
> I think you must be doing something wrong with your class, the #do: is
> implemented as streaming over the record one by one, never holding more
> than one in memory.
> 
> This is what I tried:
> 
> 'paul.csv' asFileReference writeStreamDo: [ :file|
>   ZnBufferedWriteStream on: file do: [ :out |
>     (NeoCSVWriter on: out) in: [ :writer |
>       writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>       1 to: 1e7 do: [ :each |
>         writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom.
> #(true false) atRandom } ] ] ] ].
> 
> This results in a 300Mb file:
> 
> $ ls -lah paul.csv 
> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
> $ wc paul.csv 
>  10000001 10000001 342781577 paul.csv
> 
> This is a selective read and collect (loads about 10K records):
> 
> Array streamContents: [ :out |
>   'paul.csv' asFileReference readStreamDo: [ :in |
>     (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>       reader skipHeader; addIntegerField; addSymbolField; addIntegerField;
> addFieldConverter: [ :x | x = #true ].
>       reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each ]
> ] ] ] ].
> 
> This worked fine on my MacBook Air, no memory problems. It takes a while
> to parse that much data, of course.
> 
> Sven
> 
>> On 14 Nov 2014, at 19:08, Paul DeBruicker &lt;

> pdebruic@

> &gt; wrote:
>> 
>> Hi -
>> 
>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so). 
>> I'm not sure if its because of the size of the files or the code I've
>> written to keep track of the domain objects I'm interested in, but I'm
>> getting out of memory errors & crashes in Pharo 3 on Mac with the latest
>> VM.  I haven't checked other vms.  
>> 
>> I'm going to profile my own code and attempt to split the files manually
>> for now to see what else it could be. 
>> 
>> 
>> Right now I'm doing something similar to
>> 
>>      |file reader|
>>      file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>      reader: NeoCSVReader on: file
>> 
>>      reader
>>              recordClass: MyClass; 
>>              skipHeader;
>>              addField: #myField:;
>>              ....
>>      
>> 
>>      reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>> eachRecord].
>>      file close.
>> 
>> 
>> 
>> Is there a facility in NeoCSVReader to read a file in batches (e.g. 1000
>> lines at a time) or an easy way to do that ?
>> 
>> 
>> 
>> 
>> Thanks
>> 
>> Paul





--
View this message in context: 
http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.

Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Reply via email to