[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
-------------------------------

    Attachment: NetflixRecomenderEvaluatorRunner.java

Runnable component for testing ParallelSGDFactorizer on netflix training 
dataset (yeah, only the trainingSet generated by NetflixDatasetConverter, I 
cannot get judging.txt for validation, but my purpose is just to test its 
efficiency on extreme scale, so whatever).

Warning! To run it without danger you need to allocate at least 12G of heap 
space to jvm by using the following VM parameters:

-Xms12288M -Xmx12288M.

In addition, 16G+ RAM is MANDATORY otherwise either garbage collection or swap 
will kill you (or both). I almost burned my laptop on this (which has only 8G 
RAM). As a result, I won't be able to post any result before I can get a better 
machine. But since its number of rating is about 6 times the size of the 
movielens-10m or libimseti dataset, and SGD scales linearly to this number, I 
estimate the running time to be between 2.5-3 minutes.

I will be utmost obliged to anybody who can try it and post the result here (of 
course, if your machine can handle it). But obviously as Sebastian has pointed 
out, our FileDataModel needs some serious optimization to handle such scale.

Hey Sebastian, can you try this out in your lab? That will be most helpful.
                
> Parallel SGD matrix factorizer for SVDrecommender
> -------------------------------------------------
>
>                 Key: MAHOUT-1272
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Peng Cheng
>            Assignee: Sean Owen
>              Labels: features, patch, test
>             Fix For: 0.8
>
>         Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
> NetflixRecomenderEvaluatorRunner.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, 
> ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to