On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote: > I can't reproduce the space leak here. I tried Aleksander's original code, > my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy > bytestring version.
I unfortunately can't post the actual corpus here, because it's copyrighted. But there's plenty of ways to retrieve large amounts of data from the Internet. See below. > f' :: Monad m => I.Iteratee S.ByteString m Wordcounts > f' = I.joinI $ (enumLinesBS I.><> I.filter (not . S.null)) $ I.foldl' (\t s > -> T.insertWith (+) s 1 t) T.empty Neat, folding the specialised function into the enumeratee is nifty! One can tell that my experience with iteratee/enumerator has been only a day's worth for now :-\ > None of these leak space for me (all compiled with ghc-7.0.3 -O2). > Performance was pretty comparable for every version, although Aleksander's > original did seem to have a very small edge. How big were your input corpora? Here's an absolutely evil shell script that is going to make me donate money to project Gutenberg. It will gather a corpus that is still very small, but at least realistic in its distributional properties. +++ scrape.sh #!/bin/sh textfile=all_text.txt touch $textfile text=0 size=0 for i in `seq 10 300`; do wget -q "http://www.gutenberg.org/files/$i/$i.zip" if [ -f $i.zip ]; then unzip -qq $i.zip tr -sc '[[:alpha:]]' '\n' < $i.txt >> $textfile text=`dc -e "$text 1 + p"` size=`du -h $textfile | cut -f 1` rm -f $i.zip $i.txt fi echo -n "\rFound $text of $i texts, total $size." done echo "\rFound $text texts, total $size" +++ It'll take a while to run, and the resulting corpus is just a fraction of what I'm using, but it serves well to illustrate the problem. If you want, you can increase the amount of data gathered by simply tweaking the numerical range in the seq statement. (If you make it gobble up more bandwidth, it might be polite to put a sleep somewhere in the inner loop. I'd host the resulting file myself, but I don't know if there aren't any conditions on that.) If you did not tweak the script, it should've gathered some 100MB of data. Running my unmodified original program, htop records 320MB of RAM usage towards the end (*without* profiling being enabled.) So it seems that I can't get rid of a factor of around 3x the input file size. Luckily, the dependency seems to be linear. Here's some profiling: <<ghc: 30478883712 bytes, 57638 GCs, 41925397/143688744 avg/max bytes residency (189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT (24.94 elapsed), 26.71 GC (27.10 elapsed) :ghc>> ../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr 50.44s user 1.50s system 99% cpu 52.064 total ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.) Redirecting the program's output to a file will yield a mere 2.2M for the data gathered by the above script. Since those 2.2M of data are all I care about, why do I need so much more RAM to compute them? Are my demands unreasonable? > I'd be happy to help track down a space leak in iteratee, but for now I'm > not seeing one. Thank you for your offer! Maybe I'm just seeing ghosts, and there is no space leak. But I really do think that the program is eating too much RAM. Regards, Aleks
signature.asc
Description: Digital signature
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe