Hi Pierre,

I had a 45~50mb file to parse using Ruby libraries but to no avail,
the DOM based libraries were slow to death and the SAX based one that
I tried (libxml-ruby) had some serious memory leaks. Now there's this
SaxMachine from paul dix that looks usable -
http://www.pauldix.net/2009/01/sax-machine-sax-parsing-made-easy.html

As to my problem, I wrote a StAX based parser using Java to get it to
run in reasonable time :(

-
Maurício Linhares
http://codeshooter.wordpress.com/ | http://twitter.com/mauriciojr

On Thu, Jun 11, 2009 at 5:41 AM, PierreW<[email protected]> wrote:
>
> Hello,
>
> I need to parse two big XML files in a row (30+MB each). I have tried
> both REXML and Hpricot. They do work. Thing is, with both libraries,
> the parsing of each file takes a huge amount of memory: more than
> 700MB each!
>
> So I was wondering:
> - is it normal that parsing a 30MB file takes 700MB of memory? Could
> it be that something is wrong with the file? Is there an alternative
> way to deal with such big files?
> - is there a way to force the release of the memory when I don't need
> the file anymore? At the moment it is not released instantly after the
> first file, so I end up with 1.5GB memory use.
>
> I have reduced the code to the minimum to isolate the memory issue:
>
> xml = File.read("myfile.xml")
> doc = REXML::Document.new(xml) or doc = Hpricot.XML(xml)
> doc = nil
>
> and repeat with the second file.
>
> Also, I tried libxml in case. I get an error message that I can't
> explain:
> LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate
> encoding !  yet the file is UTF-8 as far as I can tell.
>
> Thanks a lot for your help.
> Pierre
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to