Hi, I've updated JIRA <https://issues.apache.org/jira/browse/MAHOUT-375> with a short update on my progress(as a .diff, of course, no theatricals). I've implemented a decent portion of a pure RBM algorithm, albeit not in a distributed fashion. The specific issues that are my bottlenecks now, in a way partly because I focussed more on the algorithms variants rather than the code base yet, are:
1. Getting the dataset in. One way is I assume, to read in the dataset from a File using the FileDataModel on this<https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv> (clicking in Chrome downloads a multi-GB file, but Safari lets you see the first few lines of the csv without downloading), which is a csv-formatted version of the Netflix dataset. But is it generic enough to expect a csv input for the purpose of a recommender beyond the Netflix dataset? I looked at GenericDataModel, but the comments state that it is useful only for small experiments and not recommended for contexts where performance is important. If anybody could give some pointers on how to use dataModels in Taste, and why they are not used in the other parts(non o.a.m.cf parts) it'd be really helpful. 2. An ancillary question to the above is can the python scripts that did the above formatting for the Netflix dataset be added to a patch in the examples section along with the wikipedia and reuters examples? Or is Python not allowed? Is there a policy like anything that can be done with bash is ok and rest is subject to discussion? 3. Are there any useful Vector data types, Iterators or Matrix datatypes(sparse for the case of the Netflix dataset) which you think could be useful for this case(in o.a.m.cf.taste)? I found some, but I would definitely love any pointers towards specific stuff just to make sure I don't end up rewriting stuff because of my ignorance. :) 4. Now, the issue of map-reducing the pure RBM and putting the algorithm data structures on HDFS. I'm positive that the instance data structures of the pure RBM will not fit into memory for Netflix dataset(At least, I've tried Weka which can't load it, and even naive on-the-fly calculations of the space required in the worst-case). I think it's safer to put them on HDFS. Could you give me some pointers on how to get started on this? (Thanks Jake and Ted, I think I need the sort of operation Jake described above - wherein I can call a function f on a vector of the whole matrix(the dataset here, which is sparse) in a distributed fashion) I'll see this in detail tomorrow. But any other pointers on this issue with reference to the MAHOUT-375.diff update are very welcome. I'm sure these are naive questions, so apologies. :) I desperately wish to skip that hoop soon after this kickin-the-tyres phase is over. Thanks once again, It's good night here, but good day to everyone else! -- Sisir
