Re: [libxml-devel] memory consumption when finding inside of large document never goes away

Charlie Savage Sat, 16 Aug 2008 14:30:17 -0700

Hi Matt,

I am running on OSX and RedHat. I am using the Node#find method with an XPath expression for the currently desired node in the default namespace of the document. The crashes stopped happening when I set my nodes variable to nil before calling GC.start. The memory does not spike too much if I call GC.start after every single Node#find but since parsing a single document into the required number of ruby objects necessesitates calling Node#find over a thousand times GC.start is really slowing things down.

Right, that is what you have to do (nodes = nil before GC.start). In my view, this is a design flaw in Ruby's GC but I didn't get very far when I asked about it to the Ruby core list. We can work around it, but I haven't had a chance to do it. If you're feeling like writing some C code, I can explain how I think the problem can be fixed so you avoid all the manual GCs.

From what I can tell calling Node#find on such a large document is causing Ruby to add extra object heaps which increases my memory usage in a way that the program does not recover from. This is unfortunate since I want to run multiple processes per box but each process is using several hundred megabytes of RAM after parsing a few large documents.

Well, the bindings generally only wrap an object when you access it. So in theory, calling nodes = document.find should only add on Ruby object (the result object). The code used to wrap every returned object, but I'm pretty sure I changed it. To verify, the code is in the xpath_object class.

Now if you then iterate over each returned node in the result, they will of course get wrapped (i.e, a Ruby object is created for each libxml node).

The SAX parser with empty callbacks can rip through the document in about 17ms which is very fast in my opinion. The speed problem arrises when I try to do anything in the callbacks. The nature of the program and the structure of the XML requires me to do quite few lookups in a series of hashes to determine the type of the current node and the type of each text element. When SAX parsing I have to hit the hashes more often since I don't have as much context information available as I do with a recursive depth first document walk with the document parser node objects. With the necessary code in the callbacks I was seeing parse times around 400ms which is about twice as slow as the document based approach.


Oh, I see.  So its all in the lookups.

XMLReader looks very interesting from the API docs but I am not sure that I grok how to actually use it. I will keep searching for resources but if you know of any examples of usage out there I would love to read some code.

I think there are a couple of tests (libxml/test) that might help a bit. Can't say I'm super familiar with that code either. But look for Python examples perhaps or .NET (libxml copied the api from .NET supposedly, based on reading the libxml site).


Charlie

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel

Re: [libxml-devel] memory consumption when finding inside of large document never goes away

Reply via email to