Charlie,
I am running on OSX and RedHat.  I am using the Node#find method with an
XPath expression for the currently desired node in the default namespace of
the document.  The crashes stopped happening when I set my nodes variable to
nil before calling GC.start. The memory does not spike too much if I call
GC.start after every single Node#find but since parsing a single document
into the required number of ruby objects necessesitates calling Node#find
over a thousand times GC.start is really slowing things down.

>From what I can tell calling Node#find on such a large document is causing
Ruby to add extra object heaps which increases my memory usage in a way that
the program does not recover from.  This is unfortunate since I want to run
multiple processes per box but each process is using several hundred
megabytes of RAM after parsing a few large documents.

The SAX parser with empty callbacks can rip through the document in about
17ms which is very fast in my opinion.  The speed problem arrises when I try
to do anything in the callbacks.  The nature of the program and the
structure of the XML requires me to do quite  few lookups in a series of
hashes to determine the type of the current node and the type of each text
element.  When SAX parsing I have to hit the hashes more often since I don't
have as much context information available as I do with a recursive depth
first document walk with the document parser node objects.  With the
necessary code in the callbacks I was seeing parse times around 400ms which
is about twice as slow as the document based approach.

XMLReader looks very interesting from the API docs but I am not sure that I
grok how to actually use it.  I will keep searching for resources but if you
know of any examples of usage out there I would love to read some code.

Thank you,
Matt Margolis


2008/8/16 Charlie Savage <[EMAIL PROTECTED]>

> Hi Matt,
>
>  I am making the parsed ruby objects available to a Rails application and I
>> find that if I call GC.start when using the library with Rails that it takes
>> several seconds to garbage collect and sometimes crashes.  If I call
>> GC.start in the loop when the program is running as a standalone process
>> then GC.start returns in a few dozen milliseconds.
>>
>
> What platform are you using?  Can you run a debug version and get a stack
> trace so we can see what is going on?  Are you using XPath?  If so, make
> sure to free pointers to your XPath result objects and call GC.start before
> the associated documents get freed (see the rdocs for more info,
> document#find I think it is).
>
>  I wrote a SAX style parser using libxml-ruby that does not suffer from the
>> memory growth but it is about 30 times slower than the document based parser
>> so I am really trying to make the document based approach work.
>>
>
> Why do you suppose SAX is so much slower.  It should be a lot faster since
> it doesn't build an in-memory tree.
>
> Any chance the XMLReader would work for you?
>
> Charlie
>
> _______________________________________________
> libxml-devel mailing list
> libxml-devel@rubyforge.org
> http://rubyforge.org/mailman/listinfo/libxml-devel
>
_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel

Reply via email to