Re: [libxml-devel] memory consumption when finding inside of large document never goes away

Matthew Margolis Sat, 16 Aug 2008 14:46:18 -0700

I don't mind setting nodes = nil before calling GC.start (read some other
threads so I think I understand why I have to do that) but I do mind the
speed hit, so if you think there is a way around that I would love to know
more.


My general calling pattern is
1.  Document#find_first to get the most top level element I am interested
in
2.  top_level_element#find for each of its direct children.  When I find
each child I then recurse down and load that children's children.

So yes I am walking the entire tree which will create a bunch of objects.
 When only grabbing the top level element in my test program I am still
seeing a big spike in memory.  I looked at the XPath Object code and it
looks to me like this case is the one I am going to match when trying to
find the topmost element of interest.

  case XPATH_NODESET:
    rval = Data_Wrap_Struct(cXMLXPathObject,
                            ruby_xml_xpath_object_mark,
                            ruby_xml_xpath_object_free,
                            xpop);

I am not familiar with Data_Wrap_Struct(part of Ruby?) so I don't know if it
could potentially create lots of objects.

I will look at the XMLReader tests to try to get a better feel for if it
will meet my needs.  Thank you for the suggestion.
Matt Margolis

2008/8/16 Charlie Savage <[EMAIL PROTECTED]>

> Hi Matt,
>
>  I am running on OSX and RedHat.  I am using the Node#find method with an
>> XPath expression for the currently desired node in the default namespace of
>> the document.  The crashes stopped happening when I set my nodes variable to
>> nil before calling GC.start. The memory does not spike too much if I call
>> GC.start after every single Node#find but since parsing a single document
>> into the required number of ruby objects necessesitates calling Node#find
>> over a thousand times GC.start is really slowing things down.
>>
>
> Right, that is what you have to do (nodes = nil before GC.start).  In my
> view, this is a design flaw in Ruby's GC but I didn't get very far when I
> asked about it to the Ruby core list.  We can work around it, but I haven't
> had a chance to do it.  If you're feeling like writing some C code, I can
> explain how I think the problem can be fixed so you avoid all the manual
> GCs.
>
>   From what I can tell calling Node#find on such a large document is
>> causing Ruby to add extra object heaps which increases my memory usage in a
>> way that the program does not recover from.  This is unfortunate since I
>> want to run multiple processes per box but each process is using several
>> hundred megabytes of RAM after parsing a few large documents.
>>
>
> Well, the bindings generally only wrap an object when you access it.  So in
> theory, calling nodes = document.find should only add on Ruby object (the
> result object).  The code used to wrap every returned object, but I'm pretty
> sure I changed it.  To verify, the code is in the xpath_object class.
>
> Now if you then iterate over each returned node in the result, they will of
> course get wrapped (i.e, a Ruby object is created for each libxml node).
>
>  The SAX parser with empty callbacks can rip through the document in about
>> 17ms which is very fast in my opinion.  The speed problem arrises when I try
>> to do anything in the callbacks.  The nature of the program and the
>> structure of the XML requires me to do quite  few lookups in a series of
>> hashes to determine the type of the current node and the type of each text
>> element.  When SAX parsing I have to hit the hashes more often since I don't
>> have as much context information available as I do with a recursive depth
>> first document walk with the document parser node objects.  With the
>> necessary code in the callbacks I was seeing parse times around 400ms which
>> is about twice as slow as the document based approach.
>>
>
> Oh, I see.  So its all in the lookups.
>
>  XMLReader looks very interesting from the API docs but I am not sure that
>> I grok how to actually use it.  I will keep searching for resources but if
>> you know of any examples of usage out there I would love to read some code.
>>
>
> I think there are a couple of tests (libxml/test) that might help a bit.
>  Can't say I'm super familiar with that code either.  But look for Python
> examples perhaps or .NET (libxml copied the api from .NET supposedly, based
> on reading the libxml site).
>
> Charlie
>
> _______________________________________________
> libxml-devel mailing list
> libxml-devel@rubyforge.org
> http://rubyforge.org/mailman/listinfo/libxml-devel
>

_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel

Re: [libxml-devel] memory consumption when finding inside of large document never goes away

Reply via email to