Re: [Haskell-cafe] memory needed for SAX parsing XML

Daniil Elovkov Tue, 20 Apr 2010 01:05:40 -0700

Jason Dagit wrote:

On Mon, Apr 19, 2010 at 3:01 AM, Daniil Elovkov<daniil.elov...@googlemail.com <mailto:daniil.elov...@googlemail.com>>wrote:
    Hello haskellers!

    I'm trying to process an xml file with as little footprint as
    possible. SAX is alright for my case, and I think that's the
    lightest way possible. So, I'm looking at HaXml.SAX

    I'm surprised to see that it takes about 56-60 MB of ram. This seems
    constant relative to xml file size, which is expected. Only slightly
    depends on it as I recursively traverse the list of sax events. But
    it seems like too much.
For me these sorts of problems always involve investigation into theroot cause. I'm just not good enough at predicting what is causing thememory consumption. Thankfully, GHC has great tools for this sort ofinvestigative work. The book real-world haskell documents how to usethose tools:
http://book.realworldhaskell.org/read/profiling-and-optimization.html
If you haven't already, I highly recommend looking at the profilinggraphs. See if you can figure out if your program has any space leaks.

No, I didn't profile it yet. I just made up the simplest example (whileevaluating a few xml parsers) and saw the result.

Given the triviality of the example I considered it could be a "feature"of HaXml SAX parser, and that somebody would say that it is indeed so.

    The size of the file is from 1MB to 20MB.

    The code is something like this

    main = do
       (fn:_) <- getArgs
       h <- openFile fn ReadMode
       c <- hGetContents h
       let out = proc $ fst $ saxParse fn c
       putStrLn out
       getChar
For such a simple program you won't run into any problem with lazy IO,but as your program grows in complexity it will very likely come back tobite you. If you're not familiar with lazy IO, I'm referring to thehGetContents. Some example problems:

....

Thanks, I'm aware of that. Speaking of this, I've always felt a bituncomfortable about lazy IO. I've just looked at the safe-lazy-iopackage now.

I think iteratees are slowly catching on as an alternative to lazy io.Basically, the iteratee approach uses a left fold style to stream thedata and process it in chunks, including some exception handling.Unfortunately, I think it may also require a special sax parser that isspecifically geared towards iteratee use. Having an iteratee based saxparser would make processing large xml streams very convenient inhaskell. Hint, hint, if you want to write a library :) (Or, maybe itexists, I admit that I haven't checked.)

Iteratees seem like a natural thing if we want to completely avoidunsafeInterleaveIO. But the presence of the latter is so good formodularity.

We can glue IO code and pure code of the type String -> a so seamlessly.In case of iteratees, as I understand, pure functions of the type String-> a would no longer be usable with IO String result. The signature (andthe code itself) would have to be changed to be left fold.

Another (additional) approach would be to encapsulate unsafeInterleaveIOwithin some routine and not let it go out into the wild.


lazilyDoWithIO :: IO a -> (a -> b) -> IO b

It would use unsafeInterleave internally but catch all IO errors withinitself.


I wonder if this is a reasonable idea? Maybe it's already done?
So the topic is shifting...

--
Daniil Elovkov
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] memory needed for SAX parsing XML

Reply via email to