I don't mind setting nodes = nil before calling GC.start (read some other threads so I think I understand why I have to do that) but I do mind the speed hit, so if you think there is a way around that I would love to know more.
My general calling pattern is 1. Document#find_first to get the most top level element I am interested in 2. top_level_element#find for each of its direct children. When I find each child I then recurse down and load that children's children. So yes I am walking the entire tree which will create a bunch of objects. When only grabbing the top level element in my test program I am still seeing a big spike in memory. I looked at the XPath Object code and it looks to me like this case is the one I am going to match when trying to find the topmost element of interest. case XPATH_NODESET: rval = Data_Wrap_Struct(cXMLXPathObject, ruby_xml_xpath_object_mark, ruby_xml_xpath_object_free, xpop); I am not familiar with Data_Wrap_Struct(part of Ruby?) so I don't know if it could potentially create lots of objects. I will look at the XMLReader tests to try to get a better feel for if it will meet my needs. Thank you for the suggestion. Matt Margolis 2008/8/16 Charlie Savage <[EMAIL PROTECTED]> > Hi Matt, > > I am running on OSX and RedHat. I am using the Node#find method with an >> XPath expression for the currently desired node in the default namespace of >> the document. The crashes stopped happening when I set my nodes variable to >> nil before calling GC.start. The memory does not spike too much if I call >> GC.start after every single Node#find but since parsing a single document >> into the required number of ruby objects necessesitates calling Node#find >> over a thousand times GC.start is really slowing things down. >> > > Right, that is what you have to do (nodes = nil before GC.start). In my > view, this is a design flaw in Ruby's GC but I didn't get very far when I > asked about it to the Ruby core list. We can work around it, but I haven't > had a chance to do it. If you're feeling like writing some C code, I can > explain how I think the problem can be fixed so you avoid all the manual > GCs. > > From what I can tell calling Node#find on such a large document is >> causing Ruby to add extra object heaps which increases my memory usage in a >> way that the program does not recover from. This is unfortunate since I >> want to run multiple processes per box but each process is using several >> hundred megabytes of RAM after parsing a few large documents. >> > > Well, the bindings generally only wrap an object when you access it. So in > theory, calling nodes = document.find should only add on Ruby object (the > result object). The code used to wrap every returned object, but I'm pretty > sure I changed it. To verify, the code is in the xpath_object class. > > Now if you then iterate over each returned node in the result, they will of > course get wrapped (i.e, a Ruby object is created for each libxml node). > > The SAX parser with empty callbacks can rip through the document in about >> 17ms which is very fast in my opinion. The speed problem arrises when I try >> to do anything in the callbacks. The nature of the program and the >> structure of the XML requires me to do quite few lookups in a series of >> hashes to determine the type of the current node and the type of each text >> element. When SAX parsing I have to hit the hashes more often since I don't >> have as much context information available as I do with a recursive depth >> first document walk with the document parser node objects. With the >> necessary code in the callbacks I was seeing parse times around 400ms which >> is about twice as slow as the document based approach. >> > > Oh, I see. So its all in the lookups. > > XMLReader looks very interesting from the API docs but I am not sure that >> I grok how to actually use it. I will keep searching for resources but if >> you know of any examples of usage out there I would love to read some code. >> > > I think there are a couple of tests (libxml/test) that might help a bit. > Can't say I'm super familiar with that code either. But look for Python > examples perhaps or .NET (libxml copied the api from .NET supposedly, based > on reading the libxml site). > > Charlie > > _______________________________________________ > libxml-devel mailing list > libxml-devel@rubyforge.org > http://rubyforge.org/mailman/listinfo/libxml-devel >
_______________________________________________ libxml-devel mailing list libxml-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/libxml-devel