[Haskell-cafe] help optimizing memory usage for a program

Manlio Perillo Mon, 02 Mar 2009 07:31:03 -0800

Hi.

After some work I have managed to implement two simple programs thatparse the Netflix Prize data set.

For details about the Netflix Prize, there was a post by Kenneth Hostesome time ago.


I have cabalized the program, and made available here:
http://haskell.mperillo.ath.cx/netflix-0.0.1.tar.gz

The process-data-1 program parse the training data set, grouping ratingsby movies.

The process-data-2 program, instead, groups ratings by users.

The structure of the two programs is similar: I create singletons IntMapand then use foldl + union to merge them together.



The data structure used is:

type Rating = Word32 :*: Word8 -- id, rating
type MovieRatings = IntMap (UArr Rating)

UArr is from uvector package.

The training data set contains about 100 million ratings, so, in theory,the amount of memory required to store pairs of (id, rating) is about480 MB.

The process-data-1 program parse the entire dataset using about 1.4 GBof memory (3x increment).


This is strange.
The memory required is proportional to the number of ratings.

It may be IntMap the culprit, or the garbage collector than does notrelease the memory to the operating system (or, worse, does notdeallocate all used temporary memory).



The process-data-2 has much more problems.
When I try to parse *only* 500 movie ratings, the memory usage is

471 MB; 780 MB after all data has been fully evaluated (arrayconcatenations).



I'm using ghc 6.8.2 on Debian Etch.

I would like to do some tests with recent GHC versions, but it may be aproblem.




Thanks  Manlio Perillo
_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] help optimizing memory usage for a program

Reply via email to