Dear Cafe, (Excuse the probably very ranty email; I am, unfortunately, at the end of my wits, and I hope that as fellow programmers, you will understand that this is among the most dreadful situations for our kind to be in.)
Say, we have an input file that contains a word per line. I want to find all unigrams (unique words) in that file, and associate with them the amount of times they occurred in the file. This would allow me, for example, to make a list of word frequencies in a given text. Simple enough task. Here's an implementation using iteratees (lazy IO is evil) and unordered-containers' Data.HashMap.Strict, which enforces WHNF in values and keys: > import qualified Data.ByteString.Char8 as S > import qualified Data.Iteratee as I > import Data.Iteratee.IO > > import qualified Data.HashMap.Strict as T > > import Data.Iteratee.Char > import Data.List (foldl') > import Data.Char (toLower) > > import Data.Ord (comparing) > import Data.List (sortBy) > import System.Environment (getArgs) > > type Wordcounts = T.HashMap S.ByteString Int > > f' :: Monad m => I.Iteratee S.ByteString m Wordcounts > f' = I.joinI $ enumLinesBS (I.liftI $ step T.empty) > where step t (I.Chunk str) = I.liftI (step $! foldl' maybeIncrement t str) > step t stream = I.idone t stream > maybeIncrement t s > | s == S.empty = t > | otherwise = {-# SCC "m-I-other" #-} T.insertWith (+) s 1 t > > main :: IO () > main = getArgs >>= fileDriverVBuf 65536 f'.head >>= print.prettyList > where prettyList = -- sortBy (comparing snd) . T.toList > T.keys Some lines are empty, and I don't want them to be recorded, so that's why maybeIncrement is necessary. hpaste of this code: http://hpaste.org/47300/spaceleak (ignore convert, that's yet another issue.) Now, here's some observations: on a 75M input file (minuscule, compared to what I actually need) this program will eat 30M of heap space (says profiling) and return in 14 secs. I have two problems with that: a) that's too much heap space, b) the actual memory residency is *much* worse. ad b) average memory residency is at 38MB (this is OK, given heap consumption) but max residency is at 130MB, which is unacceptable to me (remember that I need to run this on files *much* bigger than just 75M.) <<ghc: 9013546200 bytes, 16876 GCs, 38232524/130572448 avg/max bytes residency (58 samples), 306M in use, 0.00 INIT (0.00 elapsed), 6.12 MUT (6.28 elapsed), 7.45 GC (7.50 elapsed) :ghc>> In fact, htop reports total memory residency of the program at around 320 MB at the end of its life-cycle (compiled and ran without profiling options.) I tried running this program on a 1.4GB file, and had to kill it when at 3.5GB memory consumption, my computer started paging. The actual hash-map, however, shouldn't be much bigger than in the 75MB case (I expect about twice the size,) since the input is natural language. I redirected the output of the program (showing a list of assoc pairs that were in the hash map) to a file, and that file measured 11MB in the case of a 75MB corpus, and 40MB when I ran the program on a 1.4GB corpus (I had to use a machine with 24GB of RAM to be able to run this.) ad a) heap consumption is too high for two reasons: firstly, the actual data I care about is much less than there's data on the heap. Secondly, about half the heap space is in LAG state. Here are profiles that will illustrate this: http://imgur.com/wBWmJ&XN1mW<YNR. - The first image shows 50% of the heap space being gobbled up with data that shouldn't be there anymore (LAG) - The second image shows the types that are in LAG state: ByteString and HashMap. So, it seems I'm keeping around hash maps? - The third image blames the cost centre m-i-other for the LAG. - Then there's the retainer profile: I have no idea how to read it. I thought it had something to do with iteratee reading in chunks of data, but after setting the chunk size to 4096 bytes and actually getting *less* spikes than before, I doubt that really is the case. It seems like half of the data this program puts on the heap is not cleaned up properly, and the otherwise-branch of maybeIncrement (i.e. the only really important line is this program) is to blame. I'm thinking there's too little strictness somewhere in the T.insertWith clause, but I have no idea how to make it better. Honestly, sprinkling around magic seq-dust every time something goes wrong doesn't feel good. I cannot reason about Haskell's space-performance even a little bit, and I have no idea how to change that. Some more remarks: - I tried using lazy IO first, but that was a total disaster. Iteratee is nice, and works very well, but maybe I don't know how to use it? - In terms of map-implementation, I observed that unordered-containers < Data.Map < bytestring-trie. It surprised me to see that Data.Map would perform better than bytestring-trie for this task (usually, Data.Map would be around twice as fast.) - I should probably look into using a mutable data structure as a map. I would find it a bit disenchanting to have to use a mutable data structure this way, but maybe that's just reality for you. RANT I have tried and tried again to avoid writing programs in Haskell that would leak space like BP likes to leak oil. However, I have yet to produce a single instance of a program that would do anything at all and at the same time consume less memory than there is actual data in the input file. It is very disconcerting to me that I seem to be unable, even after quite some practice, to identify space leaks in trivial programs like the above. I know of no good resource to educate myself in that regard. I have read the GHC manual, RWH's chapter on profiling, also "Inside T5"'s recent series on the Haskell heap, but no dice. Even if I can clearly see the exact line where at least some of the leaking happens (as I can in this case,) it seems impossible for me to prevent it. *thank you very much* for reading this far. This is probably a mostly useless email anyhow, I just had to get it off my chest. Maybe, just maybe, someone among you will have a crucial insight that will save Haskell for me :-) But currently, I see no justification to not start my next project in Lua, Python or Java. Sure, Haskell's code is pretty, and it's fun, but if I can't actually *run* it, why bother? (Yes, this isn't the first time I've ran into this problem …) Thanks, and kindest regards to you all, Aleks
signature.asc
Description: Digital signature
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe