I compared spark-itemsimilatity to the Hadoop version on sample data that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the following speedup.
Platform Elapsed Time Mahout Hadoop 0:20:37 Mahout Spark 0:02:19 This isn’t quite apples to apples because the Spark version does all the dictionary management, which is usually two extra jobs tacked on before and after the Hadoop job. I’ve done the complete pipeline using Hadoop and Spark now and can say that not only is it faster now but the old Hadoop way required keeping track of 10x more intermediate data and connecting up many more jobs to get the pipeline working. Now it’s just one job. You don’t need to worry about ID translation anymore and you get over 10x faster completion — this is one of those times when speed meets ease-of-use.
