Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

2009-03-01 Thread Manlio Perillo

Kenneth Hoste ha scritto:

Hello,

I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.

I kind of have an algorithm in mind that I want to implement using Haskell,
but up until now, the main issue has been to find a way to efficiently 
represent

the data...

For people who are not familiar with the Netflix data, in short, it 
consist of
roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different 
movies, coming from

480,109 different users.



Hi Kenneth.

I have written a simple program that parses the Netflix training data 
set, using this data structure:


type MovieRatings = IntMap (UArr Word32, UArr Word8)

The ratings are grouped by movies.

The parsing is done in:
real8m32.476s
user3m5.276s
sys 0m8.681s

On a DELL Inspiron 6400 notebook,
Intel Core2 T7200 @ 2.00GHz, and 2 GB memory.


However the memory used is about 1.4 GB.
How did you manage to get 700 MB memory usage?


Note that the minimum space required is about 480 MB (assuming 4 byte 
integer for the ID, and 1 byte integer for rating).
Using a 4 byte integer for both ID and rating, the space required is 
about 765 MB.


1.5 GB is the space required if one uses a total of 16 bytes to store 
both the ID and the rating.




Maybe it is the garbage collector that does not release memory to the 
operating system?




Thanks  Manlio Perillo
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

2009-02-26 Thread Manlio Perillo

Kenneth Hoste ha scritto:

Hello,

I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.

[...]
To see if I could efficiently represent the data set in this way, I 
wrote a small

Haskell program (attached) which uses the following data type:



From what I see, to append a new integer to the Array, you convert the 
array to a list, append the new element to the list, and then convert to 
array again.


Isn't this a bit inefficient?

The uvector package implements a vector of unboxed types, and has an 
snocU operation, to append an element to the array.


I don't know how efficient it is, however.


By the way, about uvector: it has a Stream data type, and you can build 
a vector from a stream.


But how this work and how (if any) the stream data is integrated with 
other packages?

The package documentations seems to be still incomplete.

 [...]



Regards  Manlio Perillo
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

2009-02-26 Thread Kenneth Hoste


On Feb 26, 2009, at 13:00 , Manlio Perillo wrote:


Kenneth Hoste ha scritto:

Hello,
I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.
[...]
To see if I could efficiently represent the data set in this way, I  
wrote a small

Haskell program (attached) which uses the following data type:


From what I see, to append a new integer to the Array, you convert  
the array to a list, append the new element to the list, and then  
convert to array again.


Isn't this a bit inefficient?


Yes, performance-wise this is terribly inefficient, I agree. But, it  
was just an artefact of how the raw data is organized.


My main concern was the memory usage of the huge IntMap with UArray  
elements.
Once I solved that, I would be able to get around the performance  
issue by reorganizing the raw data.


However, as I posted yesterday, I've been able to circumvent the issue  
by rethinking my data type, i.e. using
the ~18K movie IDs as key instead of the 480K user IDs, which  
radically limits the overhead...
That way, I'm able to fit the data set in 700M of memory, without  
having to reorganize the raw data.


The uvector package implements a vector of unboxed types, and has an  
snocU operation, to append an element to the array.


I don't know how efficient it is, however.



By the way, about uvector: it has a Stream data type, and you can  
build a vector from a stream.


Thanks for letting me know, I'll keep this in mind.

greetings,

Kenneth

--

Kenneth Hoste
Paris research group - ELIS - Ghent University, Belgium
email: kenneth.ho...@elis.ugent.be
website: http://www.elis.ugent.be/~kehoste
blog: http://boegel.kejo.be

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

2009-02-26 Thread Manlio Perillo

Kenneth Hoste ha scritto:

[...]
However, as I posted yesterday, I've been able to circumvent the issue 
by rethinking my data type, i.e. using
the ~18K movie IDs as key instead of the 480K user IDs, which radically 
limits the overhead...


Well, but what if you really need the original data structure, for 
better data processing?


That way, I'm able to fit the data set in 700M of memory, without 
having to reorganize the raw data.


The uvector package implements a vector of unboxed types, and has an 
snocU operation, to append an element to the array.


I don't know how efficient it is, however.



By the way, about uvector: it has a Stream data type, and you can 
build a vector from a stream.


Thanks for letting me know, I'll keep this in mind.



Let me know if there are performance improvements.

Arrays are one of the few things I dislike in Haskell, and all the 
available array/vector packages cause me some confusion.





Regards   Manlio
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

2009-02-23 Thread Kenneth Hoste

Hello,

I'm having a go at the Netflix Prize using Haskell. Yes, I'm brave.

I kind of have an algorithm in mind that I want to implement using  
Haskell,
but up until now, the main issue has been to find a way to efficiently  
represent

the data...

For people who are not familiar with the Netflix data, in short, it  
consist of
roughly 100M (1e8) user ratings (1-5, integer) for 17,770 different  
movies, coming from

480,109 different users.

The idea I have in mind requires fast access to all the ratings for a  
particular user,

so I was thinking about an IntMap in terms of user ids, which maps to
movie id/rating pairs somehow.

To see if I could efficiently represent the data set in this way, I  
wrote a small

Haskell program (attached) which uses the following data type:


testMemSizeUArray_UArray_Word8.hs
Description: Binary data




type data = IntMap (UArray Int Word8) -- map of user IDs to ratings  
(lacks movie IDs)


For convenience, and because I've been discussing this with various  
people in #haskell @ IRC,
the code is also available here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=1671#a1671 
 .


I'm aware that the way in which the UArray's are constructed (i.e. by  
adding a single element each
time), is far from efficient performance-wise, but that's not the  
point here...
By reformatting the raw data, I can easily read in the data more  
efficiently.


The issue I ran into is that changing the data type to the following,  
doesn't yield any significant

different in memory usage.

type data = IntMap (UArray Int Int) -- use Int instead of Word8 for a  
user rating


Someone (dolio) @ #haskell suggested that maybe UArray is not byte  
packed for Word8,
which would cause little difference with a UArray containing Int's,  
but someone else (dons @ #ghc)

was able to tell me it _is_ byte packed.

Does anyone know why the Word8 version is not significantly better in  
terms of memory usage?


greetings,

Kenneth

PS: My adventures on trying to tackle the Netflix Prize problem with  
Haskell can be followed at http://boegel.kejo.be.


--

Kenneth Hoste
ELIS - Ghent University
email: kenneth.ho...@elis.ugent.be
blog: http://www.elis.ugent.be/~kehoste/blog
website: http://www.elis.ugent.be/~kehoste

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

2009-02-23 Thread Bryan O'Sullivan
2009/2/23 Kenneth Hoste kenneth.ho...@ugent.be


 Does anyone know why the Word8 version is not significantly better in terms
 of memory usage?


Yes, because there's a typo on line 413 of Data/Array/Vector/Prim/BUArr.hs.

How's that for service? :-)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

2009-02-23 Thread Don Stewart
bos:
 2009/2/23 Kenneth Hoste kenneth.ho...@ugent.be
  
 
 Does anyone know why the Word8 version is not significantly better in 
 terms
 of memory usage?
 
 
 Yes, because there's a typo on line 413 of Data/Array/Vector/Prim/BUArr.hs.
 
 How's that for service? :-)

UArray or UArr?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

2009-02-23 Thread Kenneth Hoste


On Feb 23, 2009, at 19:57 , Don Stewart wrote:


bos:

2009/2/23 Kenneth Hoste kenneth.ho...@ugent.be


   Does anyone know why the Word8 version is not significantly  
better in terms

   of memory usage?


Yes, because there's a typo on line 413 of Data/Array/Vector/Prim/ 
BUArr.hs.


How's that for service? :-)


UArray or UArr?


Well, I'm using UArray, but I'm willing to consider other suitable  
containers...

As long as they are memory efficient. :-)

The typical usage of a UArray will be getting all it's contents,
and converting it to a list to easily manipulate (filter, ...).

So, maybe another data type allows me to store the data in a limited  
amount of memory

(which is my main concern now)...

K.

--

Kenneth Hoste
ELIS - Ghent University
email: kenneth.ho...@elis.ugent.be
blog: http://www.elis.ugent.be/~kehoste/blog
website: http://www.elis.ugent.be/~kehoste

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] memory-efficient data type for Netflix data - UArray Int Int vs UArray Int Word8

2009-02-23 Thread wren ng thornton

Kenneth Hoste wrote:
Well, I'm using UArray, but I'm willing to consider other suitable 
containers...

As long as they are memory efficient. :-)

The typical usage of a UArray will be getting all it's contents,
and converting it to a list to easily manipulate (filter, ...).

So, maybe another data type allows me to store the data in a limited 
amount of memory

(which is my main concern now)...


Have you considered using spectral (or counting) bloom filters? I know 
there's a non-spectral version available[1], though I haven't looked at 
it to see how easily it could be made to count.



[1] http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter

--
Live well,
~wren
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe