Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Alain Rastoul Sat, 15 Nov 2014 16:48:00 -0800

Ah, this reminded me an old thread about memory on windows about why thewindows setting was 512 by default

http://lists.pharo.org/pipermail/pharo-dev_lists.pharo.org/2011-April/047594.html

And about vm options, it reminded me too that on windows the option hada trailing ':' that didn't exist on mac IIRC

(and no double '-' on mac I think).
-memory: 1024 against -memory 1024

may be stupid, but perhaps you could try it ?




Le 16/11/2014 01:08, Paul DeBruicker a écrit :

Hi Sven,

I think you are right that I am mis-estimating how big these objects are,
especially lots of them.  But I still think there's another problem with the
image not growing to the machine or VM limits.

Eliot Miranda created a script to see how big a heap (using Spur) could grow
on different platforms (the email is here:
http://forum.world.st/New-Cog-VMs-available-td4764823.html#a4764840).  I've
adapted it ever so slightly to run on Pharo 3:

| them |
them := OrderedCollection new.
[[them addLast: (ByteArray new: 16000000).
  Transcript cr; show: ((Smalltalk vm parameterAt: 3) / (1024*1024.0)
printShowingDecimalPlaces: 1); flush] repeat]
on: OutOfMemory
do: [:ex| 2 to: them size by: 2 do: [:i| them at: i put: nil. Smalltalk
garbageCollect]].
Transcript cr; show: ((Smalltalk vm parameterAt: 3) / (1024*1024.0)
printShowingDecimalPlaces: 1); flush.
them := nil.
Smalltalk garbageCollect.
Transcript cr; show: ((Smalltalk vm parameterAt: 3) / (1024*1024.0)
printShowingDecimalPlaces: 1); flush

When I run it in Pharo 3 (& Pharo 1.4) on my laptop this is the output:


49.4
64.6
79.9
95.2
110.4
125.7
140.9
156.2
171.5
186.7
202.0
217.2
232.5
247.8
263.0
278.3
293.5
308.8
324.1
339.3
354.6
369.8
385.1
400.3
415.6
430.9
446.1
461.4
476.6
491.9
507.2
278.3
34.1

It shows that the heap grows to about 500MB and then an OutOfMemory error is
thrown.


It is my intuition that this test would make the image grow to either the
limit of the machine or the limit of the VM, whichever came first.

Is there a setting I need to change to make the image grow to, say, 1GB for
this test?

Starting the image with the '--memory 1000m' command line argument doesn't
change the test result.


Also - that weird stack with 'output file stack is full' was a result of
running the MessageTally profiler and hitting the issue that John McIntosh
described here:
http://forum.world.st/Squeak-hang-at-full-cpu-help-tp3006008p3007628.html


And for my immediate needs of processing the CSV files I ported everything
to GemStone and am all set in that regard.






Sven Van Caekenberghe-2 wrote

On 14 Nov 2014, at 23:14, Paul DeBruicker &lt;

pdebruic@

&gt; wrote:


Hi Sven

Yes, like I said earlier, after your first email, that I think its not a
problem with NeoCSV as with what I'm doing and an out of memory
condition.

Have you ever seen a stack after sending kill -SIGUSR1 that looks like
this:

output file stack is full.
output file stack is full.
output file stack is full.
output file stack is full.
output file stack is full.
....


What does that mean?


I don't know, but I think that you are really out of memory.
BTW, I think that setting no flags is better, memory will expand maximally
then.
I think the useful maximum is closer to 1GB than 2GB.

Answers to your questions below.


It is difficult to follow what you are doing exactly, but I think that you
underestimate how much memory a parsed, structured/nested object uses.
Taking the second line of your example, the 20+ fields, with 3
DateAndTimes, easily cost between 512 and 1024 bytes per record. That
would limit you to between 1M and 2M records.

I tried this:

Array streamContents: [ :data |
        5e2 timesRepeat: [
                data nextPut: (Array streamContents: [ :out |
                        20 timesRepeat: [ out nextPut: Character alphabet ].
                        3 timesRepeat: [ out nextPut: DateAndTime now ] ]) ] ].

it worked to 5e5, but not for 5e6 - I didn't try numbers in between as it
takes very long.

Good luck, if you can solve this, please tell us how you did it.

Thanks again for helping me out



Sven Van Caekenberghe-2 wrote

OK then, you *can* read/process 300MB .csv files ;-)

What does your CSV file look like, can you show a couple of lines ?

here are 2 lines + a header:

"provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
"015009","BURNS NURSING HOME, INC.","701 MONROE STREET
NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
residents proper treatment to prevent new bed (pressure) sores or heal
existing bed sores.","D","Deficient, Provider has date of
correction","2013-10-10",1,"Y","N","2014-01-01"
"015009","BURNS NURSING HOME, INC.","701 MONROE STREET
NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
that each resident who enters the nursing home without a catheter is not
given a catheter, unless medically necessary, and that incontinent
patients receive proper services to prevent urinary tract infections and
restore normal bladder functions.","D","Deficient, Provider has date of
correction","2013-10-10",1,"Y","N","2014-01-01"

You are using a custom record class of your own, what does that look
like
or do ?

A custom record class. This is all publicly available data but I'm
keeping track of the performance of US based health care providers
during
their annual inspections. So the records are notes of a deficiency
during
the inspection and I'm keeping those notes in a collection in an
instance
of the health care provider's class. The custom record class just
converts the CSV record to objects (Integers, Strings, DateAndTime) and
then gets stuffed in the health care provider's deficiency history
OrderedCollection (which has about 100 items). Again I don't think
its
what I'm doing as much as the image isn't growing when it needs to.

Maybe you can try using Array again ?

I've attempted to do it where I parse and convert the entire CSV into
domain objects then add them to the image and the parsing works fine,
but
the system runs out of resources during the update phase.

What percentage of records read do you keep ? In my example it was very
small. Have you tried calculating your memory usage ?

I'm keeping some data from every record, but it doesn't load more than
500MB of the data before falling over. I am not attempting to load the
9GB of CSV files into one image. For 95% of the records in the CSV file
20 of the 22 columns of the data is the same from file to file, just a
'published date' and a 'time to expiration' date changes. Each file
covers a month, with about 500k deficiencies. Each month some
deficiencies are added to the file and some are resolved. So the total
number of deficiencies in the image is about 500k. Of those records
that
don't expire in a given month I'm adding the published date to a
collection of published dates for the record and also adding the "time
to
expiration" to a collection of those to record what was made public and
letting the rest of the data get GC'd. I don't only load those two
records because the other fields of the record in the CSV could change.

I have not calculated the memory usage for the collection because I
thought it would have no problem fitting in the 2GB of RAM I have on
this
machine.

On 14 Nov 2014, at 22:34, Paul DeBruicker &lt;

pdebruic@

&gt; wrote:


Yes. With the image & vm I'm having trouble with I get an array with
9,942
elements in it.  So its works as you'd expect.

While processing the CSV file the image stays at about 60MB in RAM.









Sven Van Caekenberghe-2 wrote

Can you successfully run my example code ?

On 14 Nov 2014, at 22:03, Paul DeBruicker &lt;

pdebruic@

&gt; wrote:


Hi Sven,

Thanks for taking a look and testing the NeoCSVReader portion for me.
You're right of course that there's something I'm doing that's slow.
But.
There is something I can't figure out yet.

To provide a little more detail:

When the 'csv reading' process completes successfully profiling shows
that
most of the time is spent in NeoCSVReader>>#peekChar and using
NeoCSVReader>>##addField: to convert a string to a DateAndTime.
Dropping
the DateAndTime conversion speeds things up but doesn't stop it from
running
out of memory.

I start the image with

./pharo-ui --memory 1000m myimage.image

Splitting the CSV file helps:
~1.5MB  5,000 lines = 1.2 seconds.
~15MB   50,000 lines = 8 seconds.
~30MB   100,000 lines = 16 seconds.
~60MB   200,000 lines  = 45 seconds.


It seems that when the CSV file crosses ~70MB in size things start
going
haywire with performance, and leads to the out of memory condition.
The
processing never ends.  Sending "kill -SIGUSR1" prints a stack
primarily
composed of:

0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060:
a(n)
OutOfMemory class
0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
OutOfMemory class
0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
OutOfMemory
class
0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060:
a(n)
OutOfMemory class
0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
OutOfMemory class
0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
OutOfMemory
class
0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060:
a(n)
OutOfMemory class

So it seems like its trying to signal that its out of memory after
its
out
of memory which triggers another OutOfMemory error.  So that's why
progress
stops.


** Aside - OutOfMemory should probably be refactored to be able to
signal
itself without taking up more memory, triggering itself infinitely.
Maybe
it & its signalling morph infrastructure would be good as a singleton
**



I'm confused about why it runs out of memory.  According to htop the
image
only takes up about 520-540 MB of RAM when it reaches the
'OutOfMemory'
condition.  This Macbook Air laptop has 4GB, and has plenty of room
for
the
image to grow.  Also I've specified a 1,000MB image size when
starting.
So
it should have plenty of room.  Is there something I should check or
a
flag
somewhere that prevents it from growing on a Mac?  This is the latest
Pharo30 VM.


Thanks for helping me get to the bottom of this

Paul















Sven Van Caekenberghe-2 wrote

Hi Paul,

I think you must be doing something wrong with your class, the #do:
is
implemented as streaming over the record one by one, never holding
more
than one in memory.

This is what I tried:

'paul.csv' asFileReference writeStreamDo: [ :file|
ZnBufferedWriteStream on: file do: [ :out |
  (NeoCSVWriter on: out) in: [ :writer |
    writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
    1 to: 1e7 do: [ :each |
      writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
atRandom.
#(true false) atRandom } ] ] ] ].

This results in a 300Mb file:

$ ls -lah paul.csv
-rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
$ wc paul.csv
10000001 10000001 342781577 paul.csv

This is a selective read and collect (loads about 10K records):

Array streamContents: [ :out |
'paul.csv' asFileReference readStreamDo: [ :in |
  (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
    reader skipHeader; addIntegerField; addSymbolField;
addIntegerField;
addFieldConverter: [ :x | x = #true ].
    reader do: [ :each | each third < 1000 ifTrue: [ out nextPut:
each
]
] ] ] ].

This worked fine on my MacBook Air, no memory problems. It takes a
while
to parse that much data, of course.

Sven

On 14 Nov 2014, at 19:08, Paul DeBruicker &lt;

pdebruic@

&gt; wrote:


Hi -

I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
so).
I'm not sure if its because of the size of the files or the code
I've
written to keep track of the domain objects I'm interested in, but
I'm
getting out of memory errors & crashes in Pharo 3 on Mac with the
latest
VM.  I haven't checked other vms.

I'm going to profile my own code and attempt to split the files
manually
for now to see what else it could be.


Right now I'm doing something similar to

        |file reader|
        file:= '/path/to/file/myfile.csv' asFileReference readStream.
        reader: NeoCSVReader on: file

        reader
                recordClass: MyClass;
                skipHeader;
                addField: #myField:;
                ....
        

        reader do:[:eachRecord | self
seeIfRecordIsInterestingAndIfSoKeepIt:
eachRecord].
        file close.



Is there a facility in NeoCSVReader to read a file in batches (e.g.
1000
lines at a time) or an easy way to do that ?




Thanks

Paul






--
View this message in context:
http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
Sent from the Pharo Smalltalk Users mailing list archive at
Nabble.com.






--
View this message in context:
http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.






--
View this message in context:
http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.






--
View this message in context: 
http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790441.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.

Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Reply via email to