Re: [GSOC] Matrix Operations on HDFS

Sisir Koppaka Sun, 30 May 2010 15:25:30 -0700

Hi,
I've updated JIRA <https://issues.apache.org/jira/browse/MAHOUT-375> with a
short update on my progress(as a .diff, of course, no theatricals). I've
implemented a decent portion of a pure RBM algorithm, albeit not in a
distributed fashion. The specific issues that are my bottlenecks now, in a
way partly because I focussed more on the algorithms variants rather than
the code base yet, are:


1. Getting the dataset in. One way is I assume, to read in the dataset from
a File using the FileDataModel on
this<https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv>
(clicking
in Chrome downloads a multi-GB file, but Safari lets you see the first few
lines of the csv without downloading), which is a csv-formatted version of
the Netflix dataset. But is it generic enough to expect a csv input for the
purpose of a recommender beyond the Netflix dataset?

I looked at GenericDataModel, but the comments state that it is useful only
for small experiments and not recommended for contexts where performance is
important. If anybody could give some pointers on how to use dataModels in
Taste, and why they are not used in the other parts(non o.a.m.cf parts) it'd
be really helpful.

2. An ancillary question to the above is can the python scripts that did the
above formatting for the Netflix dataset be added to a patch in the examples
section along with the wikipedia and reuters examples? Or is Python not
allowed? Is there a policy like anything that can be done with bash is ok
and rest is subject to discussion?

3. Are there any useful Vector data types, Iterators or Matrix
datatypes(sparse for the case of the Netflix dataset) which you think could
be useful for this case(in o.a.m.cf.taste)? I found some, but I would
definitely love any pointers towards specific stuff just to make sure I
don't end up rewriting stuff because of my ignorance. :)

4. Now, the issue of map-reducing the pure RBM and putting the algorithm
data structures on HDFS. I'm positive that the instance data structures of
the pure RBM will not fit into memory for Netflix dataset(At least, I've
tried Weka which can't load it, and even naive on-the-fly calculations of
the space required in the worst-case). I think it's safer to put them on
HDFS. Could you give me some pointers on how to get started on this? (Thanks
Jake and Ted, I think I need the sort of operation Jake described above  -
wherein I can call a function f on a vector of the whole matrix(the dataset
here, which is sparse) in a distributed fashion) I'll see this in detail
tomorrow. But any other pointers on this issue with reference to the
MAHOUT-375.diff update are very welcome.

I'm sure these are naive questions, so apologies. :) I desperately wish to
skip that hoop soon after this kickin-the-tyres phase is over.

Thanks once again,
It's good night here, but good day to everyone else!

--
Sisir

Re: [GSOC] Matrix Operations on HDFS

Reply via email to