Jason Dagit wrote:


On Mon, Apr 19, 2010 at 3:01 AM, Daniil Elovkov <daniil.elov...@googlemail.com <mailto:daniil.elov...@googlemail.com>> wrote:

    Hello haskellers!

    I'm trying to process an xml file with as little footprint as
    possible. SAX is alright for my case, and I think that's the
    lightest way possible. So, I'm looking at HaXml.SAX

    I'm surprised to see that it takes about 56-60 MB of ram. This seems
    constant relative to xml file size, which is expected. Only slightly
    depends on it as I recursively traverse the list of sax events. But
    it seems like too much.


For me these sorts of problems always involve investigation into the root cause. I'm just not good enough at predicting what is causing the memory consumption. Thankfully, GHC has great tools for this sort of investigative work. The book real-world haskell documents how to use those tools:
http://book.realworldhaskell.org/read/profiling-and-optimization.html

If you haven't already, I highly recommend looking at the profiling graphs. See if you can figure out if your program has any space leaks.

No, I didn't profile it yet. I just made up the simplest example (while evaluating a few xml parsers) and saw the result.

Given the triviality of the example I considered it could be a "feature" of HaXml SAX parser, and that somebody would say that it is indeed so.



    The size of the file is from 1MB to 20MB.

    The code is something like this

    main = do
       (fn:_) <- getArgs
       h <- openFile fn ReadMode
       c <- hGetContents h
       let out = proc $ fst $ saxParse fn c
       putStrLn out
       getChar


For such a simple program you won't run into any problem with lazy IO, but as your program grows in complexity it will very likely come back to bite you. If you're not familiar with lazy IO, I'm referring to the hGetContents. Some example problems:

....

Thanks, I'm aware of that. Speaking of this, I've always felt a bit uncomfortable about lazy IO. I've just looked at the safe-lazy-io package now.

I think iteratees are slowly catching on as an alternative to lazy io. Basically, the iteratee approach uses a left fold style to stream the data and process it in chunks, including some exception handling. Unfortunately, I think it may also require a special sax parser that is specifically geared towards iteratee use. Having an iteratee based sax parser would make processing large xml streams very convenient in haskell. Hint, hint, if you want to write a library :) (Or, maybe it exists, I admit that I haven't checked.)


Iteratees seem like a natural thing if we want to completely avoid unsafeInterleaveIO. But the presence of the latter is so good for modularity.

We can glue IO code and pure code of the type String -> a so seamlessly. In case of iteratees, as I understand, pure functions of the type String -> a would no longer be usable with IO String result. The signature (and the code itself) would have to be changed to be left fold.

Another (additional) approach would be to encapsulate unsafeInterleaveIO within some routine and not let it go out into the wild.

lazilyDoWithIO :: IO a -> (a -> b) -> IO b

It would use unsafeInterleave internally but catch all IO errors within itself.

I wonder if this is a reasonable idea? Maybe it's already done?
So the topic is shifting...

--
Daniil Elovkov
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to