On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:
>
> It is sometimes difficult to obtain realistic "Big" data sets.  A 
> Revolution Analytics blog post yesterday
>
>
> http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-will-become-repeat-buyers.html
>
> mentioned the competition
>
> http://www.kaggle.com/c/acquire-valued-shoppers-challenge
>
> with a very large data set, which may be useful in looking at performance 
> bottlenecks.
>
> You do need to sign up to be able to download the data and you must agree 
> only to use the data for the purposes of the competition and to remove the 
> data once the competition is over.
>

I did download the largest of the data files which consists of about 350 
million records on 11 variables in CSV format.  The compressed file is 
around 2.6 GB, uncompressed it would be over 22GB.  Fortunately, the GZip 
package allows for working with the compressed file for sequential access.

Most of the variables are what I would call categorical (stored as integer 
values) and could be represented as a pooled data vector.  One variable is 
a date and one is a price which could be stored as an integer value (number 
of cents) or as a Float32.

So the first task would be parsing all those integers and creating a binary 
representation.  This could be done using a Relational DataBase but I think 
that might be overkill for a static table like this.  I have been thinking 
of storing each column as a memory-mapped array in a format like pooled 
data.  That is, store only the indices into a table of values so that the 
indices can be represented as whatever size of unsigned int is large enough 
for the table size.

To work out the storage format I should first determine the number of 
distinct values for each categorical variable.  I was planning on using 
split(readline(gzfilehandle,",")) applying int() to the appropriate fields 
and storing the values in a Set or perhaps an IntSet.  Does this seem like 
a reasonable way to start?

Reply via email to