Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-02 Thread Ketil Malde

Since frequency counts are an important use of map-like data structures,
I did a brief test of the available options.  First using regular
strings for input, and Data.Map.fromListWith - i.e. the operational bit being:

  freqs :: [String] - M.Map String Int
  freqs = M.fromListWith (+) . map (,1)

This runs on a 14M corpus consisting of King James Bible, collected
works of Shakespeare, and War and Peace.

./freqstr1 +RTS -s 
   5,093,386,368 bytes allocated in the heap
   2,682,667,904 bytes copied during GC
 261,110,368 bytes maximum residency (20 sample(s))
   9,018,000 bytes maximum slop
 623 MB total memory in use (10 MB lost due to fragmentation)
./freqstr1 +RTS -s  21.43s user 0.78s system 99% cpu 22.285 total

Kinda expensive, 250MB to store word frequencies of 14MB text.

Now, changing to 

  freqs :: [String] - M.Map String Int
  freqs = foldl' (\m w - M.insertWith' (+) w 1 m) M.empty

i.e. using strict insertion, avoiding the buildup of lazy thunks for the
counts.

./freqstr2 +RTS -s  -- strings, using strict insertion
   4,754,110,096 bytes allocated in the heap
   2,089,527,240 bytes copied during GC
  27,039,112 bytes maximum residency (66 sample(s))
 613,192 bytes maximum slop
  80 MB total memory in use (2 MB lost due to fragmentation)
./freqstr2 +RTS -s  17.48s user 0.13s system 99% cpu 17.665 total

This reduced maximam memory consumption to one tenth, still bigger than
input corpus, but clearly not too bad.  A bit faster, too, in spite of
probably doing more work.

Using ByteStrings instead, first fromListWith:

./freq +RTS -s
(Just 77432,113931)
   3,880,059,568 bytes allocated in the heap
   1,467,507,808 bytes copied during GC
 174,573,752 bytes maximum residency (14 sample(s))
   8,222,600 bytes maximum slop
 385 MB total memory in use (6 MB lost due to fragmentation)
./freq +RTS -s  14.26s user 0.49s system 99% cpu 14.798 total

About half the memroy of Strings, and 25% faster.  With strict insertion:

./freq2 +RTS -s   -- map using strict insertion
   3,761,614,312 bytes allocated in the heap
 849,806,000 bytes copied during GC
  23,950,328 bytes maximum residency (35 sample(s))
   2,376,904 bytes maximum slop
  58 MB total memory in use (1 MB lost due to fragmentation)
./freq2 +RTS -s  11.14s user 0.13s system 99% cpu 11.295 total

Parallel to the String case, this is a lot more frugal with memory, and
30% faster.  Now, I tried Data.HashMap from the hashmap library:

./freqH1 +RTS -s-- hashmap using fromListWith
   4,552,922,784 bytes allocated in the heap
   2,990,287,536 bytes copied during GC
 401,247,912 bytes maximum residency (14 sample(s))
  42,098,016 bytes maximum slop
 957 MB total memory in use (15 MB lost due to fragmentation)
./freqH1 +RTS -s  15.68s user 1.53s system 99% cpu 17.277 total

./freqH2 +RTS -s   -- hashmap using foldl' insertWith
   4,518,146,968 bytes allocated in the heap
   2,986,973,352 bytes copied during GC
 394,502,832 bytes maximum residency (14 sample(s))
  41,020,040 bytes maximum slop
 957 MB total memory in use (15 MB lost due to fragmentation)
./freqH2 +RTS -s  15.86s user 1.62s system 99% cpu 17.537 total

HashMap doesn't provide a strict insertWith, so this is similar to the
lazy insertions above.  A bit worse, actually, probably due to the
overhead of hashing.

Then, I discovered that Johan's hashmap is a different library, and
thought I'd try that too for completeness.

./freqHS +RTS -s  -- hashmap strict (unordered-containers)
   2,628,628,752 bytes allocated in the heap
 945,571,872 bytes copied during GC
  26,635,744 bytes maximum residency (32 sample(s))
   2,433,504 bytes maximum slop
  66 MB total memory in use (1 MB lost due to fragmentation)

./freqHS +RTS -s  6.90s user 0.16s system 99% cpu 7.082 total

Memory residency like the other strict versions, but really fast,
probably due to faster comparisons of hash values vs comparisons of
strings. 

Conclusion: make sure you are using a strict map, and if your keys are
strings or otherwise have expensive comparisons, unordered-containers is
the library for you.

-k

PS: I also tried mapping 'copy' on the input words to avoid storing
large slices of the input, but it only worsened things:

./freqHS3 +RTS -s 
(Just 77432,113931)
   3,109,585,024 bytes allocated in the heap
 936,724,184 bytes copied during GC
  87,831,888 bytes maximum residency (19 sample(s))
   8,835,440 bytes maximum slop
 164 MB total memory in use (3 MB lost due to fragmentation)
./freqHS3 +RTS -s  12.71s user 0.31s system 99% cpu 13.060 total

Perhaps if you managed to only copy new words it would look better?

PPS: I tried to be careful juggling the results around, but there's
always the possiblity of a mistake.  Caveat lector!  (Or should that be
'cave scriptor'?)

PPPS: There are some small interface annoyances around, it'd be nice 

Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread John Lato

 From: Brandon Moore brandon_m_mo...@yahoo.com


 I was worried data sharing might mean your keys
 retain entire 64K chunks of the input. However, it
 seems enumLines depends on the StringLike ByteString
 instance, which just converts to and from String.
 That can't be efficient, but I suppose it avoids excessive sharing.


That's true for 'enumLines', however the OP is using 'enumLinesBS', which
operates on bytestrings directly.

Data sharing certainly could be an issue here.  I tried performing
Data.ByteString.copy before inserting the key into the map, but that used
more memory.  I don't have an explanation for this; it's not what I would
expect.

The other parameter which affects sharing is the chunk size.  I got a much
better memory profile when using a chunksize of 1024 instead of  65536.

Oddly enough, when using the large chunksize I saw lower memory usage from
Data.Map, but with the small chunksize Data.HashMap has a significant
advantage.

John Lato
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread John Lato
On Wed, Jun 1, 2011 at 12:55 AM, Aleksandar Dimitrov 
aleks.dimit...@googlemail.com wrote:

 On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:

  None of these leak space for me (all compiled with ghc-7.0.3 -O2).
  Performance was pretty comparable for every version, although
 Aleksander's
  original did seem to have a very small edge.

 How big were your input corpora?


Today I was using multiple copies of War  Peace, as Brandon specified.
 Total size is about 90M.



 So it seems that I can't get rid of a factor of around 3x the input file
 size.
 Luckily, the dependency seems to be linear. Here's some profiling:

 ghc: 30478883712 bytes, 57638 GCs, 41925397/143688744 avg/max bytes
 residency (189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT
 (24.94 elapsed), 26.71 GC (27.10 elapsed) :ghc
 ../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr  50.44s user 1.50s
 system 99% cpu 52.064 total

 ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.)

 Redirecting the program's output to a file will yield a mere 2.2M for the
 data
 gathered by the above script. Since those 2.2M of data are all I care
 about, why
 do I need so much more RAM to compute them?

 Are my demands unreasonable?


I think the issue is data sharing, as Brandon mentioned.  A bytestring
consists of an offset, length, and a pointer.  You're using a chunk size of
64k, which means the generated bytestrings all have a pointer to that 64k of
data.  Suppose there's one new word in that 64k, and it's near the beginning
of the chunk.  Even though the word may only be a few characters long, it'll
reference the entire chunk and keep it from being GC'd.

There are a few solutions to this.  The first is to make a copy of the
bytestring so only the required data is retained.  In my experiments this
wasn't helpful, but it would depend on your corpus.  The second is to start
with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.
 If your corpus is similar to natural language, I think it'll probably work
better for you as well.

Note that Johan's Ngram code also only keeps the minimum required data,
giving it a good memory profile.   I didn't notice this last night because I
was testing with different data, and unfortunately the peculiar distribution
of that data masked this problem.

John L
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Daniel Fischer
On Wednesday 01 June 2011 12:13:54, John Lato wrote:
  From: Brandon Moore brandon_m_mo...@yahoo.com
  
  
  I was worried data sharing might mean your keys
  retain entire 64K chunks of the input. However, it
  seems enumLines depends on the StringLike ByteString
  instance, which just converts to and from String.
  That can't be efficient, but I suppose it avoids excessive sharing.
 
 That's true for 'enumLines', however the OP is using 'enumLinesBS',
 which operates on bytestrings directly.
 
 Data sharing certainly could be an issue here.  I tried performing
 Data.ByteString.copy before inserting the key into the map, but that
 used more memory.  I don't have an explanation for this; it's not what
 I would expect.

If you don't copy, the small ByteStrings (the words) point into the large 
chunk, which peacefully rests in memory. A few ripples come from the chunk 
boundaries, where more often than not a word will have parts in both 
chunks, a new ByteString is allocated for those.
A bit of space is wasted for the Constructor of the large chunk and for the 
word-separators ('\n' here), and at the boundaries. Not a big deal, the 
newlines cost one byte per text-word, but the constructors cost five or so 
machine words per text-word.

If you copy each small ByteString from the chunk, a) the chunk stays in 
memory until it's completely copied [doesn't matter much with chunk sizes 
of a few k] and b) you get memory fragmentation because of alignment 
requirements.

 
 The other parameter which affects sharing is the chunk size.  I got a
 much better memory profile when using a chunksize of 1024 instead of 
 65536.
 
 Oddly enough, when using the large chunksize I saw lower memory usage
 from Data.Map, but with the small chunksize Data.HashMap has a
 significant advantage.
 
 John Lato

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Daniel Fischer
On Wednesday 01 June 2011 12:28:28, John Lato wrote:
 There are a few solutions to this.  The first is to make a copy of the
 bytestring so only the required data is retained.  In my experiments
 this wasn't helpful, but it would depend on your corpus.  The second is
 to start with smaller chunks.

The third, check whether the word is already known, and *make a copy if 
not*. That should only keep the required parts (including the currently 
processed chunk) in memory.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Edward Z. Yang
That sounds like a plausible reason why naive copying explodes space.
Something like string interning would be good here... and since you're
hashing already...

Edward

Excerpts from Daniel Fischer's message of Wed Jun 01 06:46:24 -0400 2011:
 On Wednesday 01 June 2011 12:28:28, John Lato wrote:
  There are a few solutions to this.  The first is to make a copy of the
  bytestring so only the required data is retained.  In my experiments
  this wasn't helpful, but it would depend on your corpus.  The second is
  to start with smaller chunks.
 
 The third, check whether the word is already known, and *make a copy if 
 not*. That should only keep the required parts (including the currently 
 processed chunk) in memory.
 

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Aleksandar Dimitrov
Hi John,
 I think the issue is data sharing, as Brandon mentioned.  A bytestring
 consists of an offset, length, and a pointer.  You're using a chunk size of
 64k, which means the generated bytestrings all have a pointer to that 64k of
 data.  Suppose there's one new word in that 64k, and it's near the beginning
 of the chunk.  Even though the word may only be a few characters long, it'll
 reference the entire chunk and keep it from being GC'd.

This seems to be the issue indeed! If the bytestrings on the hash map are
holding references to the chunks, it is clear that we're going to consume memory
scaling with the size of the input file, in case there are *any* new chunks
generated.

As I understand it, this what not a problem with the multiple copies of War and
Peace, because all byte strings were already found on the hash table! On reading
new input, old entries were found on the hash table, so only old chunks were
kept in memory, the new ones could be gc'ed.

In *realistic* data, however, the Long Tail is the reason that, after a while,
some chunks of input would only be kept because there were a few byte strings
referencing them. New words are rare, but they need not occur more frequently
than once per chunk, in order to keep the whole chunk in memory, even though the
chunk was mostly useless (most other data in this chunk would already be on the
hash map.)

 There are a few solutions to this.  The first is to make a copy of the
 bytestring so only the required data is retained.  In my experiments this
 wasn't helpful, but it would depend on your corpus.  The second is to start
 with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.
  If your corpus is similar to natural language, I think it'll probably work
 better for you as well.

I think I solved this problem elegantly: I used Data.Text as hash map keys,
instead of Data.ByteString. See the modified program below:

 import qualified Data.Iteratee as I
 import Data.Iteratee.Char
 import Data.Iteratee.IO
 
 import qualified Data.HashMap.Strict as T
 
 import Data.Ord (comparing)
 import Data.List (sortBy)
 import System.Environment (getArgs)
 
 import qualified Data.ByteString as S
 import qualified Data.Text as X
 import Data.Text.Encoding (decodeUtf8)
 
 type Wordcounts = T.HashMap X.Text Int
 
 f' :: Monad m = I.Iteratee S.ByteString m Wordcounts
 f' = I.joinI $ (enumLinesBS I. I.filter (not.S.null)) $ I.foldl'
 (\t s - T.insertWith (+) (convert s) 1 t) T.empty
 where convert = decodeUtf8
 
 main :: IO ()
 main = getArgs = fileDriverVBuf 65536 f'.head
= mapM_ (\(w,c)- putStrLn $ X.unpack w ++ \t ++ show c).sortM
 where sortM = sortBy (comparing snd) . T.toList

Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my
script yesterday report the following: htop says 120M memory residency towards
the end of the life-cycle.

ghc: 40928992256 bytes, 77984 GCs, 20939886/46665600 avg/max bytes residency 
(455 samples), 133M in use, 0.00 INIT (0.00 elapsed), 26.01 MUT (26.20 
elapsed), 32.96 GC (33.09 elapsed) :ghc

19MB avg, 44MB max residency, 133M in use (which is similar to what htop told
me) and the heap profile clearly shows allocation and deallocation of the
bytestrings: http://imgur.com/U1nyw (I'm attributing the ruggedness of the
profile with the frequent little spikes to the iteratee chunk allocation and
deallocation.) It seems I can't get rid of 50% of the heap being lag state. I
don't quite understand yet what that is. I also don't know what INHERENT_USE is.

But in any case, I now have a program that I can reasonably expect to run if I
could fit the input file into memory. I might try to implement an analogous
program in C++ or Java, just to see whether that would do better or similarly in
terms of memory consumption.

 Note that Johan's Ngram code also only keeps the minimum required data,
 giving it a good memory profile.   I didn't notice this last night because I
 was testing with different data, and unfortunately the peculiar distribution
 of that data masked this problem.

This is kind of the big problem here: whether or not you'll see the particular
behaviour I was so upset about seemed to depend on the corpus' distributional
properties.

In any case, I would like to thank you all for helping me understand and address
the issue. I probably still have a long way to go in terms of understanding
space-behaviour of Haskell programs, but at least I don't see ByteStrings as
this black box of efficiency anymore, but actually understand how they're
structured, and what they're good at, and what they aren't good at.

(take home lesson: Data.Text is really nice. Also: if iteratee has a space leak,
I probably didn't hit it, really. But: if reading byte-strings, one should mind
the pointers that byte-strings actually are.)

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org

Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Ketil Malde
Aleksandar Dimitrov aleks.dimit...@googlemail.com writes:

 Now, here's some observations: on a 75M input file (minuscule, compared to 
 what
 I actually need) this program will eat 30M of heap space (says profiling) and
 return in 14 secs.

 I have two problems with that: a) that's too much heap space, b) the actual 
 memory
 residency is *much* worse.

30M isn't a lot these days.  How does it scale?

 ad b) average memory residency is at 38MB (this is OK, given heap consumption)
 but max residency is at 130MB, which is unacceptable to me (remember that I 
 need
 to run this on files *much* bigger than just 75M.)

I think max residency (depending on how you measure) can be twice the
heap size due to using copying GC.  If you run short of memory, the
runtime will switch to compacting GC which will be slower but use less
memory. 

 I have tried and tried again to avoid writing programs in Haskell that would
 leak space like BP likes to leak oil.

I know the feeling.  I think making a frequency table ought to be as
simple as 

   ... M.fromListWith (+) . map (,1) . words = readFile ...

or at worst

  ... foldl' (\m w - M.insertWith' (+) w 1 m) M.empty . words = readFile...

(which evaluates things strictly, and at least in my small tests, seem
to use quite a bit less heap space).

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread Johan Tibell
Hi Aleks,

On Wed, Jun 1, 2011 at 12:14 AM, Aleksandar Dimitrov
aleks.dimit...@googlemail.com wrote:
 I implemented your method, with these minimal changes (i.e. just using a main
 driver in the same file.)

 countUnigrams :: Handle - IO (M.Map S.ByteString Int)
 countUnigrams = foldLines (\ m s - M.insertWith (+) s 1 m) M.empty

 main :: IO ()
 main = do (f:_) - getArgs
           openFile f ReadMode = countUnigrams = print . M.toList

 It seems to perform about 3x worse than the iteratee method in terms of time,
 and worse in terms of space :-( On Brandon's War  Peace example, hGetLine 
 uses
 1.565 seconds for the small file, whereas my iteratee method uses 1.085s for 
 the
 small file, and around 2 minutes for the large file.

That's curious. I chatted with Duncan Coutts today and he mentioned
that hGetLine can be a bit slow as it needs to take a lock in every
read and causes some copying, which could explain why it's slower than
iteratee which works in blocks. However, I don't understand why it
uses more memory. The ByteStrings that are returned by hGetLine should
have an underlying storage of the same size as the ByteString (as
reported by length). You can try to verify this by calling 'copy' on
the ByteString before inserting it.

It looks like hGetLine needs some love.

 I also tried sprinkling strictness annotations throughout your above code, 
 but I
 failed to produce good results :-(

The strictness of the code I gave should be correct. The problem
should be elsewhere.

 I, unfortunately, don't really have any contact to the elders, apart from 
 what
 I read on their respective blogs…

You and everyone else. :) I just spent enough time talking to people
on IRC, reading good code, blogs and mailing list posts. I think Bryan
described the process pretty well in his CUFP keynote:

http://www.serpentine.com/blog/2009/09/23/video-of-my-cufp-keynote/

Cheers,
Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-06-01 Thread malcolm.wallace
Just out of interest, did you try reading the input as plain old Strings? They may be unfashionable these days, and have their own type of badness in space and time performance, but might perhaps be a win over ByteStrings for extremely large datasets.Regards,
MalcolmOn 01 Jun, 2011,at 02:49 PM, Aleksandar Dimitrov aleks.dimit...@googlemail.com wrote:Hi John,
 I think the issue is data sharing, as Brandon mentioned.  A bytestring
 consists of an offset, length, and a pointer.  You're using a chunk size of
 64k, which means the generated bytestrings all have a pointer to that 64k of
 data.  Suppose there's one new word in that 64k, and it's near the beginning
 of the chunk.  Even though the word may only be a few characters long, it'll
 reference the entire chunk and keep it from being GC'd.

This seems to be the issue indeed! If the bytestrings on the hash map are
holding references to the chunks, it is clear that we're going to consume memory
scaling with the size of the input file, in case there are *any* new chunks
generated.

As I understand it, this what not a problem with the multiple copies of War and
Peace, because all byte strings were already found on the hash table! On reading
new input, old entries were found on the hash table, so only old chunks were
kept in memory, the new ones could be gc'ed.

In *realistic* data, however, the Long Tail is the reason that, after a while,
some chunks of input would only be kept because there were a few byte strings
referencing them. New words are rare, but they need not occur more frequently
than once per chunk, in order to keep the whole chunk in memory, even though the
chunk was mostly useless (most other data in this chunk would already be on the
hash map.)

 There are a few solutions to this.  The first is to make a copy of the
 bytestring so only the required data is retained.  In my experiments this
 wasn't helpful, but it would depend on your corpus.  The second is to start
 with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.
  If your corpus is similar to natural language, I think it'll probably work
 better for you as well.

I think I solved this problem elegantly: I used Data.Text as hash map keys,
instead of Data.ByteString. See the modified program below:

 import qualified Data.Iteratee as I
 import Data.Iteratee.Char
 import Data.Iteratee.IO
 
 import qualified Data.HashMap.Strict as T
 
 import Data.Ord (comparing)
 import Data.List (sortBy)
 import System.Environment (getArgs)
 
 import qualified Data.ByteString as S
 import qualified Data.Text as X
 import Data.Text.Encoding (decodeUtf8)
 
 type Wordcounts = T.HashMap X.Text Int
 
 f' :: Monad m = I.Iteratee S.ByteString m Wordcounts
 f' = I.joinI $ (enumLinesBS I. I.filter (not.S.null)) $ I.foldl'
 (\t s - T.insertWith (+) (convert s) 1 t) T.empty
 where convert = decodeUtf8
 
 main :: IO ()
 main = getArgs = fileDriverVBuf 65536 f'.head
= mapM_ (\(w,c)- putStrLn $ X.unpack w ++ "\t" ++ show c).sortM
 where sortM = sortBy (comparing snd) . T.toList

Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my
script yesterday report the following: htop says 120M memory residency towards
the end of the life-cycle.

ghc: 40928992256 bytes, 77984 GCs, 20939886/46665600 avg/max bytes residency (455 samples), 133M in use, 0.00 INIT (0.00 elapsed), 26.01 MUT (26.20 elapsed), 32.96 GC (33.09 elapsed) :ghc

19MB avg, 44MB max residency, 133M in use (which is similar to what htop told
me) and the heap profile clearly shows allocation and deallocation of the
bytestrings: http://imgur.com/U1nyw (I'm attributing the ruggedness of the
profile with the frequent little spikes to the iteratee chunk allocation and
deallocation.) It seems I can't get rid of 50% of the heap being lag state. I
don't quite understand yet what that is. I also don't know what INHERENT_USE is.

But in any case, I now have a program that I can reasonably expect to run if I
could fit the input file into memory. I might try to implement an analogous
program in C++ or Java, just to see whether that would do better or similarly in
terms of memory consumption.

 Note that Johan's Ngram code also only keeps the minimum required data,
 giving it a good memory profile.   I didn't notice this last night because I
 was testing with different data, and unfortunately the peculiar distribution
 of that data masked this problem.

This is kind of the big problem here: whether or not you'll see the particular
behaviour I was so upset about seemed to depend on the corpus' distributional
properties.

In any case, I would like to thank you all for helping me understand and address
the issue. I probably still have a long way to go in terms of understanding
space-behaviour of Haskell programs, but at least I don't see ByteStrings as
this black box of efficiency anymore, but actually understand how they're
structured, and what they're good at, and what they aren't good at.

(take home 

[Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
Dear Cafe,

(Excuse the probably very ranty email; I am, unfortunately, at the end of my
wits, and I hope that as fellow programmers, you will understand that this is
among the most dreadful situations for our kind to be in.)

Say, we have an input file that contains a word per line. I want to find all
unigrams (unique words) in that file, and associate with them the amount of
times they occurred in the file. This would allow me, for example, to make a
list of word frequencies in a given text.

Simple enough task. Here's an implementation using iteratees (lazy IO is evil)
and unordered-containers' Data.HashMap.Strict, which enforces WHNF in values
and keys:

 import qualified Data.ByteString.Char8 as S
 import qualified Data.Iteratee as I
 import Data.Iteratee.IO
 
 import qualified Data.HashMap.Strict as T
 
 import Data.Iteratee.Char
 import Data.List (foldl')
 import Data.Char (toLower)
 
 import Data.Ord (comparing)
 import Data.List (sortBy)
 import System.Environment (getArgs)
 
 type Wordcounts = T.HashMap S.ByteString Int
 
 f' :: Monad m = I.Iteratee S.ByteString m Wordcounts
 f' = I.joinI $ enumLinesBS (I.liftI $ step T.empty)
 where step t (I.Chunk str) = I.liftI (step $! foldl' maybeIncrement t str)
   step t stream = I.idone t stream
   maybeIncrement t s
   | s == S.empty = t
   | otherwise= {-# SCC m-I-other #-} T.insertWith (+) s 1 t
 
 main :: IO ()
 main = getArgs = fileDriverVBuf 65536 f'.head = print.prettyList
 where prettyList = -- sortBy (comparing snd) . T.toList
T.keys

Some lines are empty, and I don't want them to be recorded, so that's why
maybeIncrement is necessary.
hpaste of this code: http://hpaste.org/47300/spaceleak (ignore convert, that's
yet another issue.)

Now, here's some observations: on a 75M input file (minuscule, compared to what
I actually need) this program will eat 30M of heap space (says profiling) and
return in 14 secs.

I have two problems with that: a) that's too much heap space, b) the actual 
memory
residency is *much* worse.

ad b) average memory residency is at 38MB (this is OK, given heap consumption)
but max residency is at 130MB, which is unacceptable to me (remember that I need
to run this on files *much* bigger than just 75M.)

ghc: 9013546200 bytes, 16876 GCs, 38232524/130572448 avg/max bytes residency
(58 samples), 306M in use, 0.00 INIT (0.00 elapsed), 6.12 MUT (6.28 elapsed),
7.45 GC (7.50 elapsed) :ghc

In fact, htop reports total memory residency of the program at around 320 MB at
the end of its life-cycle (compiled and ran without profiling options.) I tried
running this program on a 1.4GB file, and had to kill it when at 3.5GB memory
consumption, my computer started paging. The actual hash-map, however, shouldn't
be much bigger than in the 75MB case (I expect about twice the size,) since the
input is natural language. I redirected the output of the program (showing a
list of assoc pairs that were in the hash map) to a file, and that file measured
11MB in the case of a 75MB corpus, and 40MB when I ran the program on a 1.4GB
corpus (I had to use a machine with 24GB of RAM to be able to run this.)

ad a) heap consumption is too high for two reasons: firstly, the actual data I
care about is much less than there's data on the heap. Secondly, about half the
heap space is in LAG state. Here are profiles that will illustrate this:
http://imgur.com/wBWmJXN1mWltYNR.

- The first image shows 50% of the heap space being gobbled up with data that
  shouldn't be there anymore (LAG)
- The second image shows the types that are in LAG state: ByteString and
  HashMap.  So, it seems I'm keeping around hash maps?
- The third image blames the cost centre m-i-other for the LAG.
- Then there's the retainer profile: I have no idea how to read it. I thought it
  had something to do with iteratee reading in chunks of data, but after setting
  the chunk size to 4096 bytes and actually getting *less* spikes than before, I
  doubt that really is the case.

It seems like half of the data this program puts on the heap is not cleaned up
properly, and the otherwise-branch of maybeIncrement (i.e. the only really
important line is this program) is to blame. I'm thinking there's too little
strictness somewhere in the T.insertWith clause, but I have no idea how to make
it better.  Honestly, sprinkling around magic seq-dust every time something goes
wrong doesn't feel good. I cannot reason about Haskell's space-performance even
a little bit, and I have no idea how to change that.

Some more remarks: 
- I tried using lazy IO first, but that was a total disaster. Iteratee is nice,
  and works very well, but maybe I don't know how to use it?
- In terms of map-implementation, I observed that unordered-containers 
  Data.Map  bytestring-trie. It surprised me to see that Data.Map would perform
  better than bytestring-trie for this task (usually, Data.Map would be around
  twice as fast.)
- I should probably 

Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
On Tue, May 31, 2011 at 06:10:00PM +0200, Aleksandar Dimitrov wrote:
 ad a) heap consumption is too high for two reasons: firstly, the actual data I
 care about is much less than there's data on the heap. Secondly, about half 
 the
 heap space is in LAG state. Here are profiles that will illustrate this:
 http://imgur.com/wBWmJXN1mWltYNR.

Pardon me, I forgot to actually add the retainer profile:
- http://i.imgur.com/xLO13
- http://imgur.com/wBWmJXN1mWltYNRxLO13l?full is the full gallery.


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread malcolm.wallace
ad a) heap consumption is too high for two reasons: firstly, the actual data I
care about is much less than there's data on the heap. Secondly, about half the
heap space is in LAG state. Here are profiles that will illustrate this:
http://imgur.com/wBWmJXN1mWltYNR.

- The first image shows 50% of the heap space being gobbled up with data that
  shouldn't be there anymore (LAG)
- The second image shows the types that are in LAG state: ByteString and
  HashMap.  So, it seems I'm keeping around hash maps?In Lag/Drag/Void/Use profiling, Lag is actually heap cells that are created too _early_. (Drag are those that are kept for longer than necessary.) Lots of Lag generally means your program is too strict - it is forcing structure long before it needs to. To fix it, you need to make things lazier. My first suspicion would fall on ByteString.Regards, Malcolm
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
 In Lag/Drag/Void/Use profiling, Lag is actually heap cells that are created
 too _early_.  (Drag are those that are kept for longer than necessary.)  Lots
 of Lag generally means your program is too strict - it is forcing structure
 long before it needs to.  To fix it, you need to make things lazier.  My first
 suspicion would fall on ByteString.

Indeed, thank you, I mixed those up. I cannot use lazy byte strings here,
because of the way Data.Iteratee.Char's enumLinesBS works (it takes strict
byte strings.)

The only other strictness in there is ($!) and foldl'. The latter is necessary
for the program to even run (i.e. not run out of stack space.)

The strict application in step's argument seems necessary, since without it, the
program consumes 1200 MB of RAM (on my 75MB test data,) and takes very very 
long.
The hb profile indicates that a lot of data is allocated up front, and then
gradually eliminated. Interestingly, removing ($!) here seemed to *introduce*
unnecessary strictness. Here's the hb profile without ($!):
http://imgur.com/Ex7Pd I don't understand what is happening here :-\ I only just
started using iteratees.

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Johan Tibell
Hi Aleksandar,

On Tue, May 31, 2011 at 6:10 PM, Aleksandar Dimitrov
aleks.dimit...@googlemail.com wrote:
 Say, we have an input file that contains a word per line. I want to find all
 unigrams (unique words) in that file, and associate with them the amount of
 times they occurred in the file. This would allow me, for example, to make a
 list of word frequencies in a given text.

Here's how I would do it:

{-# LANGUAGE BangPatterns #-}
module Ngram (countUnigrams) where

import qualified Data.ByteString as S
import qualified Data.HashMap.Strict as M
import System.IO

foldLines :: (a - S.ByteString - a) - a - Handle - IO a
foldLines f z0 h = go z0
  where
go !z = do
eof - hIsEOF h
if eof
then return z
else do
line - S.hGetLine h
go $ f z line
{-# INLINE foldLines #-}

-- Example use
countUnigrams :: IO (M.HashMap S.ByteString Int)
countUnigrams = foldLines (\ m s - M.insertWith (+) s 1 m) M.empty stdin

 RANT

 I have tried and tried again to avoid writing programs in Haskell that would
 leak space like BP likes to leak oil. However, I have yet to produce a single
 instance of a program that would do anything at all and at the same time 
 consume
 less memory than there is actual data in the input file.

 It is very disconcerting to me that I seem to be unable, even after quite some
 practice, to identify space leaks in trivial programs like the above. I know 
 of
 no good resource to educate myself in that regard. I have read the GHC manual,
 RWH's chapter on profiling, also Inside T5's recent series on the Haskell
 heap, but no dice. Even if I can clearly see the exact line where at least 
 some
 of the leaking happens (as I can in this case,) it seems impossible for me to
 prevent it.

 *thank you very much* for reading this far. This is probably a mostly useless
 email anyhow, I just had to get it off my chest. Maybe, just maybe, someone
 among you will have a crucial insight that will save Haskell for me :-) But
 currently, I see no justification to not start my next project in Lua, Python 
 or
 Java. Sure, Haskell's code is pretty, and it's fun, but if I can't actually
 *run* it, why bother?  (Yes, this isn't the first time I've ran into this
 problem …)

We definitely need more accessible material on how to reliably write
fast Haskell code. There are those among us who can, but it shouldn't
be necessary to learn it in the way they did (i.e. by lots of
tinkering, learning from the elders, etc). I'd like to write a 60 (or
so) pages tutorial on the subject, but haven't found the time.

In addition to RWH, perhaps the slides from the talk on
high-performance Haskell I gave could be useful:


http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.html

Cheers,
Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Edward Z. Yang
Hello Aleksandar,

It is possible that the iteratees library is space leaking; I recall some
recent discussion to this effect.  Your example seems simple enough that
you might recompile with a version of iteratees that has -auto-all enabled.
Unfortunately, it's not really a safe bet to assume your libraries are
leak free, and if you've pinpointed it down to a single line, and there
doesn't seem a way to squash the leak, I'd bet it's the library's fault.

Edward

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Brandon Moore
I can't reproduce heap usage growing with the
size of the input file.

I made a word list from Project Gutenberg's
copy of War and Peace by

tr -sc '[[:alpha:]]' '\n'  pg2600.txt  words.txt

Using 1, 25, or 1000 repetitions of this ~3MB wordlist
shows about 100MB of address space used according
to top, and no more than 5MB or so of haskell heap
used according to the memory profile, with a flat
memory profile.


Is your memory usage growing with the size of the input
file, or the size of the histogram?

I was worried data sharing might mean your keys
retain entire 64K chunks of the input. However, it
seems enumLines depends on the StringLike ByteString
instance, which just converts to and from String.
That can't be efficient, but I suppose it avoids excessive sharing.

The other thing that occurs to me is that the total size of
your keys would also be approximately the size of the input
file if you were using plain text without each word split onto
a separate line.

Brandon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Brandon Moore
Wait, do ByteStrings show up on a heap profile, if the space is
allocated with malloc?

Anyway, I think my tests still show that the memory used by the
process doesn't grow simply by adding more data, if you
are no longer added keys to the map.



- Original Message -
 From: Brandon Moore brandon_m_mo...@yahoo.com
 To: Aleksandar Dimitrov aleks.dimit...@googlemail.com; 
 haskell-cafe@haskell.org haskell-cafe@haskell.org
 Cc: 
 Sent: Tuesday, May 31, 2011 1:43 PM
 Subject: Re: [Haskell-cafe] How on Earth Do You Reason about Space?
 
 I can't reproduce heap usage growing with the
 size of the input file.
 
 I made a word list from Project Gutenberg's
 copy of War and Peace by
 
 tr -sc '[[:alpha:]]' '\n'  pg2600.txt  words.txt
 
 Using 1, 25, or 1000 repetitions of this ~3MB wordlist
 shows about 100MB of address space used according
 to top, and no more than 5MB or so of haskell heap
 used according to the memory profile, with a flat
 memory profile.
 
 
 Is your memory usage growing with the size of the input
 file, or the size of the histogram?
 
 I was worried data sharing might mean your keys
 retain entire 64K chunks of the input. However, it
 seems enumLines depends on the StringLike ByteString
 instance, which just converts to and from String.
 That can't be efficient, but I suppose it avoids excessive sharing.
 
 The other thing that occurs to me is that the total size of
 your keys would also be approximately the size of the input
 file if you were using plain text without each word split onto
 a separate line.
 
 Brandon


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
Hi Johan,

 Here's how I would do it:

I implemented your method, with these minimal changes (i.e. just using a main
driver in the same file.)

 countUnigrams :: Handle - IO (M.Map S.ByteString Int)
 countUnigrams = foldLines (\ m s - M.insertWith (+) s 1 m) M.empty
 
 main :: IO ()
 main = do (f:_) - getArgs
   openFile f ReadMode = countUnigrams = print . M.toList

It seems to perform about 3x worse than the iteratee method in terms of time,
and worse in terms of space :-( On Brandon's War  Peace example, hGetLine uses
1.565 seconds for the small file, whereas my iteratee method uses 1.085s for the
small file, and around 2 minutes for the large file.

For the large file, the code above starts consuming around 2.5GB of RAM,
so it clearly has a space leak somewhere. Where, I don't know.

If you want to try it out, here's a short command line to make a test corpus the
way Brandon made one:

+++

wget 'http://www.gutenberg.org/files/2600/2600.zip';
unzip 2600.zip;
touch wnp100.txt;
for i in {1..100}; do echo -n $i ; cat 2600.txt  wnp100.txt; done;
echo Done.

+++

Note, that, as I detailed in my prior email to Brandon, even if you do end up
with a (supposedly) non-leaking program for this example corpus, that doesn't
mean it'll scale well to real world data.

I also tried sprinkling strictness annotations throughout your above code, but I
failed to produce good results :-(

 We definitely need more accessible material on how to reliably write
 fast Haskell code. There are those among us who can, but it shouldn't
 be necessary to learn it in the way they did (i.e. by lots of
 tinkering, learning from the elders, etc). I'd like to write a 60 (or
 so) pages tutorial on the subject, but haven't found the time.

I'd be an eager reader :-) Please do announce it on -cafe or the usual places
should you ever come around to writing it!

I, unfortunately, don't really have any contact to the elders, apart from what
I read on their respective blogs…

 In addition to RWH, perhaps the slides from the talk on
 high-performance Haskell I gave could be useful:
 
 
 http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.html

Thanks, I'll give it a look later tomorrow!

Regards,
Aleks

PS: Sorry I didn't answer you in #haskell, I ended up having to go afk for a
short while. Thanks for all your help!


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread John Lato
From: Edward Z. Yang ezy...@mit.edu


 Hello Aleksandar,

 It is possible that the iteratees library is space leaking; I recall some
 recent discussion to this effect.  Your example seems simple enough that
 you might recompile with a version of iteratees that has -auto-all enabled.
 Unfortunately, it's not really a safe bet to assume your libraries are
 leak free, and if you've pinpointed it down to a single line, and there
 doesn't seem a way to squash the leak, I'd bet it's the library's fault.

 Edward


I can't reproduce the space leak here.  I tried Aleksander's original code,
my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy
bytestring version.

my iteratee version (only f' has changed from Aleksander's code):

f' :: Monad m = I.Iteratee S.ByteString m Wordcounts
f' = I.joinI $ (enumLinesBS I. I.filter (not . S.null)) $ I.foldl' (\t s
- T.insertWith (+) s 1 t) T.empty

my lazy bytestring version

 import Data.Iteratee.Char
 import Data.List (foldl')import Data.Char (toLower)

 import Data.Ord (comparing)
 import Data.List (sortBy)
 import System.Environment (getArgs)
 import qualified Data.ByteString.Lazy.Char8 as L
 import qualified Data.HashMap.Strict as T

 f'2 = foldl' (\t s - T.insertWith (+) s 1 t) T.empty . filter (not .
L.null) . L.lines

 main2 :: IO ()
 main2 = getArgs = L.readFile .head = print . T.keys . f'2

None of these leak space for me (all compiled with ghc-7.0.3 -O2).
Performance was pretty comparable for every version, although Aleksander's
original did seem to have a very small edge.

As someone already pointed out, keep in mind that this will use a lot of
memory anyway, unless there's a lot of repetition of words.

I'd be happy to help track down a space leak in iteratee, but for now I'm
not seeing one.

Best,
John Lato
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] How on Earth Do You Reason about Space?

2011-05-31 Thread Aleksandar Dimitrov
On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:
 I can't reproduce the space leak here.  I tried Aleksander's original code,
 my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy
 bytestring version.

I unfortunately can't post the actual corpus here, because it's copyrighted. But
there's plenty of ways to retrieve large amounts of data from the Internet. See
below.


 f' :: Monad m = I.Iteratee S.ByteString m Wordcounts
 f' = I.joinI $ (enumLinesBS I. I.filter (not . S.null)) $ I.foldl' (\t s
 - T.insertWith (+) s 1 t) T.empty

Neat, folding the specialised function into the enumeratee is nifty! One can
tell that my experience with iteratee/enumerator has been only a day's worth for
now :-\

 None of these leak space for me (all compiled with ghc-7.0.3 -O2).
 Performance was pretty comparable for every version, although Aleksander's
 original did seem to have a very small edge.

How big were your input corpora?

Here's an absolutely evil shell script that is going to make me donate money to
project Gutenberg. It will gather a corpus that is still very small, but at
least realistic in its distributional properties.

+++ scrape.sh

#!/bin/sh

textfile=all_text.txt

touch $textfile

text=0
size=0
for i in `seq 10 300`; do
wget -q http://www.gutenberg.org/files/$i/$i.zip;
if [ -f $i.zip ]; then
unzip -qq $i.zip
tr -sc '[[:alpha:]]' '\n'  $i.txt  $textfile
text=`dc -e $text 1 + p`
size=`du -h $textfile | cut -f 1`
rm -f $i.zip $i.txt
fi
echo -n \rFound $text of $i texts, total $size.
done
echo \rFound $text texts, total $size

+++

It'll take a while to run, and the resulting corpus is just a fraction of what
I'm using, but it serves well to illustrate the problem. If you want, you can
increase the amount of data gathered by simply tweaking the numerical range in
the seq statement. (If you make it gobble up more bandwidth, it might be polite
to put a sleep somewhere in the inner loop. I'd host the resulting file myself,
but I don't know if there aren't any conditions on that.)

If you did not tweak the script, it should've gathered some 100MB of data.

Running my unmodified original program, htop records 320MB of RAM usage towards
the end (*without* profiling being enabled.)

So it seems that I can't get rid of a factor of around 3x the input file size.
Luckily, the dependency seems to be linear. Here's some profiling:

ghc: 30478883712 bytes, 57638 GCs, 41925397/143688744 avg/max bytes residency 
(189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT (24.94 
elapsed), 26.71 GC (27.10 elapsed) :ghc
../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr  50.44s user 1.50s 
system 99% cpu 52.064 total

ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.)

Redirecting the program's output to a file will yield a mere 2.2M for the data
gathered by the above script. Since those 2.2M of data are all I care about, why
do I need so much more RAM to compute them?

Are my demands unreasonable?

 I'd be happy to help track down a space leak in iteratee, but for now I'm
 not seeing one.

Thank you for your offer! Maybe I'm just seeing ghosts, and there is no space
leak. But I really do think that the program is eating too much RAM.

Regards,
Aleks


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe