manlio_perillo: > Hi. > > After some work I have managed to implement two simple programs that > parse the Netflix Prize data set. > > For details about the Netflix Prize, there was a post by Kenneth Hoste > some time ago. > > I have cabalized the program, and made available here: > http://haskell.mperillo.ath.cx/netflix-0.0.1.tar.gz > > The process-data-1 program parse the training data set, grouping ratings > by movies. > The process-data-2 program, instead, groups ratings by users. > > > The structure of the two programs is similar: I create singletons IntMap > and then use foldl + union to merge them together. > > > The data structure used is: > > type Rating = Word32 :*: Word8 -- id, rating > type MovieRatings = IntMap (UArr Rating) > > UArr is from uvector package. > > > The training data set contains about 100 million ratings, so, in theory, > the amount of memory required to store pairs of (id, rating) is about > 480 MB. > > > The process-data-1 program parse the entire dataset using about 1.4 GB > of memory (3x increment). > > This is strange. > The memory required is proportional to the number of ratings. > It may be IntMap the culprit, or the garbage collector than does not > release the memory to the operating system (or, worse, does not > deallocate all used temporary memory). > > > The process-data-2 has much more problems. > When I try to parse *only* 500 movie ratings, the memory usage is > 471 MB; 780 MB after all data has been fully evaluated (array > concatenations). > > > I'm using ghc 6.8.2 on Debian Etch. > I would like to do some tests with recent GHC versions, but it may be a > problem. >
Seems like a useful benchmark of a number of things. Any chance you'll upload it to hackage? _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
