I compared  spark-itemsimilatity to the Hadoop version on sample data that is 
8.7 M, 49290 x 139738 using my little 2 machine cluster and got the following 
speedup. 

Platform                        Elapsed Time
Mahout Hadoop   0:20:37
Mahout Spark            0:02:19

This isn’t quite apples to apples because the Spark version does all the 
dictionary management, which is usually two extra jobs tacked on before and 
after the Hadoop job. I’ve done the complete pipeline using Hadoop and Spark 
now and can say that not only is it faster now but the old Hadoop way required 
keeping track of 10x more intermediate data and connecting up many more jobs to 
get the pipeline working. Now it’s just one job. You don’t need to worry about 
ID translation anymore and you get over 10x faster completion — this is one of 
those times when speed meets ease-of-use. 

Reply via email to