> On 14 Nov 2014, at 23:14, Paul DeBruicker <[email protected]> wrote:
>
> Hi Sven
>
> Yes, like I said earlier, after your first email, that I think its not a
> problem with NeoCSV as with what I'm doing and an out of memory condition.
>
> Have you ever seen a stack after sending kill -SIGUSR1 that looks like this:
>
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> ....
>
>
> What does that mean?
I don't know, but I think that you are really out of memory.
BTW, I think that setting no flags is better, memory will expand maximally then.
I think the useful maximum is closer to 1GB than 2GB.
> Answers to your questions below.
It is difficult to follow what you are doing exactly, but I think that you
underestimate how much memory a parsed, structured/nested object uses. Taking
the second line of your example, the 20+ fields, with 3 DateAndTimes, easily
cost between 512 and 1024 bytes per record. That would limit you to between 1M
and 2M records.
I tried this:
Array streamContents: [ :data |
5e2 timesRepeat: [
data nextPut: (Array streamContents: [ :out |
20 timesRepeat: [ out nextPut: Character alphabet ].
3 timesRepeat: [ out nextPut: DateAndTime now ] ]) ] ].
it worked to 5e5, but not for 5e6 - I didn't try numbers in between as it takes
very long.
Good luck, if you can solve this, please tell us how you did it.
> Thanks again for helping me out
>
>
>
> Sven Van Caekenberghe-2 wrote
>> OK then, you *can* read/process 300MB .csv files ;-)
>>
>> What does your CSV file look like, can you show a couple of lines ?
>>
>> here are 2 lines + a header:
>>
>> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
>> residents proper treatment to prevent new bed (pressure) sores or heal
>> existing bed sores.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
>> that each resident who enters the nursing home without a catheter is not
>> given a catheter, unless medically necessary, and that incontinent
>> patients receive proper services to prevent urinary tract infections and
>> restore normal bladder functions.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>
>>
>> You are using a custom record class of your own, what does that look like
>> or do ?
>>
>> A custom record class. This is all publicly available data but I'm
>> keeping track of the performance of US based health care providers during
>> their annual inspections. So the records are notes of a deficiency during
>> the inspection and I'm keeping those notes in a collection in an instance
>> of the health care provider's class. The custom record class just
>> converts the CSV record to objects (Integers, Strings, DateAndTime) and
>> then gets stuffed in the health care provider's deficiency history
>> OrderedCollection (which has about 100 items). Again I don't think its
>> what I'm doing as much as the image isn't growing when it needs to.
>>
>>
>>
>>
>> Maybe you can try using Array again ?
>>
>> I've attempted to do it where I parse and convert the entire CSV into
>> domain objects then add them to the image and the parsing works fine, but
>> the system runs out of resources during the update phase.
>>
>>
>> What percentage of records read do you keep ? In my example it was very
>> small. Have you tried calculating your memory usage ?
>>
>>
>> I'm keeping some data from every record, but it doesn't load more than
>> 500MB of the data before falling over. I am not attempting to load the
>> 9GB of CSV files into one image. For 95% of the records in the CSV file
>> 20 of the 22 columns of the data is the same from file to file, just a
>> 'published date' and a 'time to expiration' date changes. Each file
>> covers a month, with about 500k deficiencies. Each month some
>> deficiencies are added to the file and some are resolved. So the total
>> number of deficiencies in the image is about 500k. Of those records that
>> don't expire in a given month I'm adding the published date to a
>> collection of published dates for the record and also adding the "time to
>> expiration" to a collection of those to record what was made public and
>> letting the rest of the data get GC'd. I don't only load those two
>> records because the other fields of the record in the CSV could change.
>>
>> I have not calculated the memory usage for the collection because I
>> thought it would have no problem fitting in the 2GB of RAM I have on this
>> machine.
>>
>>
>>
>>> On 14 Nov 2014, at 22:34, Paul DeBruicker <
>
>> pdebruic@
>
>> > wrote:
>>>
>>> Yes. With the image & vm I'm having trouble with I get an array with
>>> 9,942
>>> elements in it. So its works as you'd expect.
>>>
>>> While processing the CSV file the image stays at about 60MB in RAM.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Can you successfully run my example code ?
>>>>
>>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker <
>>>
>>>> pdebruic@
>>>
>>>> > wrote:
>>>>>
>>>>> Hi Sven,
>>>>>
>>>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>>>> You're right of course that there's something I'm doing that's slow.
>>>>> But.
>>>>> There is something I can't figure out yet.
>>>>>
>>>>> To provide a little more detail:
>>>>>
>>>>> When the 'csv reading' process completes successfully profiling shows
>>>>> that
>>>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.
>>>>> Dropping
>>>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>>>> running
>>>>> out of memory.
>>>>>
>>>>> I start the image with
>>>>>
>>>>> ./pharo-ui --memory 1000m myimage.image
>>>>>
>>>>> Splitting the CSV file helps:
>>>>> ~1.5MB 5,000 lines = 1.2 seconds.
>>>>> ~15MB 50,000 lines = 8 seconds.
>>>>> ~30MB 100,000 lines = 16 seconds.
>>>>> ~60MB 200,000 lines = 45 seconds.
>>>>>
>>>>>
>>>>> It seems that when the CSV file crosses ~70MB in size things start
>>>>> going
>>>>> haywire with performance, and leads to the out of memory condition.
>>>>> The
>>>>> processing never ends. Sending "kill -SIGUSR1" prints a stack
>>>>> primarily
>>>>> composed of:
>>>>>
>>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>>
>>>>> So it seems like its trying to signal that its out of memory after its
>>>>> out
>>>>> of memory which triggers another OutOfMemory error. So that's why
>>>>> progress
>>>>> stops.
>>>>>
>>>>>
>>>>> ** Aside - OutOfMemory should probably be refactored to be able to
>>>>> signal
>>>>> itself without taking up more memory, triggering itself infinitely.
>>>>> Maybe
>>>>> it & its signalling morph infrastructure would be good as a singleton
>>>>> **
>>>>>
>>>>>
>>>>>
>>>>> I'm confused about why it runs out of memory. According to htop the
>>>>> image
>>>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>>>> condition. This Macbook Air laptop has 4GB, and has plenty of room for
>>>>> the
>>>>> image to grow. Also I've specified a 1,000MB image size when starting.
>>>>> So
>>>>> it should have plenty of room. Is there something I should check or a
>>>>> flag
>>>>> somewhere that prevents it from growing on a Mac? This is the latest
>>>>> Pharo30 VM.
>>>>>
>>>>>
>>>>> Thanks for helping me get to the bottom of this
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sven Van Caekenberghe-2 wrote
>>>>>> Hi Paul,
>>>>>>
>>>>>> I think you must be doing something wrong with your class, the #do: is
>>>>>> implemented as streaming over the record one by one, never holding
>>>>>> more
>>>>>> than one in memory.
>>>>>>
>>>>>> This is what I tried:
>>>>>>
>>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>>> (NeoCSVWriter on: out) in: [ :writer |
>>>>>> writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>>> 1 to: 1e7 do: [ :each |
>>>>>> writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
>>>>>> atRandom.
>>>>>> #(true false) atRandom } ] ] ] ].
>>>>>>
>>>>>> This results in a 300Mb file:
>>>>>>
>>>>>> $ ls -lah paul.csv
>>>>>> -rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv
>>>>>> $ wc paul.csv
>>>>>> 10000001 10000001 342781577 paul.csv
>>>>>>
>>>>>> This is a selective read and collect (loads about 10K records):
>>>>>>
>>>>>> Array streamContents: [ :out |
>>>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>>> (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>>> reader skipHeader; addIntegerField; addSymbolField;
>>>>>> addIntegerField;
>>>>>> addFieldConverter: [ :x | x = #true ].
>>>>>> reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>>>> ]
>>>>>> ] ] ] ].
>>>>>>
>>>>>> This worked fine on my MacBook Air, no memory problems. It takes a
>>>>>> while
>>>>>> to parse that much data, of course.
>>>>>>
>>>>>> Sven
>>>>>>
>>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker <
>>>>>
>>>>>> pdebruic@
>>>>>
>>>>>> > wrote:
>>>>>>>
>>>>>>> Hi -
>>>>>>>
>>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
>>>>>>> so).
>>>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>>>> written to keep track of the domain objects I'm interested in, but
>>>>>>> I'm
>>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>>>> latest
>>>>>>> VM. I haven't checked other vms.
>>>>>>>
>>>>>>> I'm going to profile my own code and attempt to split the files
>>>>>>> manually
>>>>>>> for now to see what else it could be.
>>>>>>>
>>>>>>>
>>>>>>> Right now I'm doing something similar to
>>>>>>>
>>>>>>> |file reader|
>>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>>> reader: NeoCSVReader on: file
>>>>>>>
>>>>>>> reader
>>>>>>> recordClass: MyClass;
>>>>>>> skipHeader;
>>>>>>> addField: #myField:;
>>>>>>> ....
>>>>>>>
>>>>>>>
>>>>>>> reader do:[:eachRecord | self
>>>>>>> seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>>>> eachRecord].
>>>>>>> file close.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>>>> 1000
>>>>>>> lines at a time) or an easy way to do that ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
>
>
> --
> View this message in context:
> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.