On Jun 11, 9:41 am, PierreW <[email protected]> wrote:
> Hello,
>
> I need to parse two big XML files in a row (30+MB each). I have tried
> both REXML and Hpricot. They do work. Thing is, with both libraries,
> the parsing of each file takes a huge amount of memory: more than
> 700MB each!
>
> So I was wondering:
> - is it normal that parsing a 30MB file takes 700MB of memory? Could
> it be that something is wrong with the file? Is there an alternative
> way to deal with such big files?

DOM parsers can use up a lot of memory with large files (10x filesize
or more). SAX parsers don't (because they don't keep the whole thing
in memory - they just fire events as they traverse the dom). REXML
does have a sax style parser, and libxml will have one too.

Fred
> - is there a way to force the release of the memory when I don't need
> the file anymore? At the moment it is not released instantly after the
> first file, so I end up with 1.5GB memory use.
>
> I have reduced the code to the minimum to isolate the memory issue:
>
> xml = File.read("myfile.xml")
> doc = REXML::Document.new(xml) or doc = Hpricot.XML(xml)
> doc = nil
>
> and repeat with the second file.
>
> Also, I tried libxml in case. I get an error message that I can't
> explain:
> LibXML::XML::Error (Fatal error: Input is not proper UTF-8, indicate
> encoding !  yet the file is UTF-8 as far as I can tell.
>
> Thanks a lot for your help.
> Pierre
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to