Hi Pierre, I had a 45~50mb file to parse using Ruby libraries but to no avail, the DOM based libraries were slow to death and the SAX based one that I tried (libxml-ruby) had some serious memory leaks. Now there's this SaxMachine from paul dix that looks usable - http://www.pauldix.net/2009/01/sax-machine-sax-parsing-made-easy.html
As to my problem, I wrote a StAX based parser using Java to get it to run in reasonable time :( - Maurício Linhares http://codeshooter.wordpress.com/ | http://twitter.com/mauriciojr On Thu, Jun 11, 2009 at 5:41 AM, PierreW<[email protected]> wrote: > > Hello, > > I need to parse two big XML files in a row (30+MB each). I have tried > both REXML and Hpricot. They do work. Thing is, with both libraries, > the parsing of each file takes a huge amount of memory: more than > 700MB each! > > So I was wondering: > - is it normal that parsing a 30MB file takes 700MB of memory? Could > it be that something is wrong with the file? Is there an alternative > way to deal with such big files? > - is there a way to force the release of the memory when I don't need > the file anymore? At the moment it is not released instantly after the > first file, so I end up with 1.5GB memory use. > > I have reduced the code to the minimum to isolate the memory issue: > > xml = File.read("myfile.xml") > doc = REXML::Document.new(xml) or doc = Hpricot.XML(xml) > doc = nil > > and repeat with the second file. > > Also, I tried libxml in case. I get an error message that I can't > explain: > LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate > encoding ! yet the file is UTF-8 as far as I can tell. > > Thanks a lot for your help. > Pierre > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---

