In lieu of the real story from our Spark experts here’s what seems to be 
happening. This will affect anyone using broadcast or (I think) accumulators 
with Spark. I’ve run into them and needed them for several things getting the 
CLI for itemsimilarity to work.

We (Mahout) seem to be using a non-default but more performant serializer 
called Kryo. In order to broadcast an object for read-only access to all Spark 
nodes Kryo must be able to serialize the broadcast object. Just about any class 
that extends Serializable will work and I thought it would do so without 
telling Kryo anything, but in practice I had to add the following to the 
existing Kryo setup:

class MahoutKryoRegistrator extends KryoRegistrator {

  override def registerClasses(kryo: Kryo) = {

    kryo.addDefaultSerializer(classOf[Vector], new 
WritableKryoSerializer[Vector, VectorWritable])
    kryo.addDefaultSerializer(classOf[DenseVector], new 
WritableKryoSerializer[Vector, VectorWritable])
    kryo.addDefaultSerializer(classOf[Matrix], new 
WritableKryoSerializer[Matrix, MatrixWritable])
    // taking the following line from Kryo docs because I need to broadcast a 
HashBiMap
    kryo.register(classOf[com.google.common.collect.HashBiMap[String, Int]], 
new JavaSerializer());
  }
}

So it seems if you are broadcasting anything not in this list, you’ll need to 
add your class. the JavaSerializer is from Kryo so 
import com.esotericsoftware.kryo.serializers.JavaSerializer

the itemsimilairty CLI now works on an HDFS + Spark cluster. I had to change 
Poms, added a jar assembly/job.xml, edited the mahout script and will close 
three tickets with https://github.com/apache/mahout/pull/22 so if you can spare 
some time please take a look at it.


On Jun 26, 2014, at 4:04 PM, Pat Ferrel <[email protected]> wrote:

If I want Spark to broadcast a Guava class do I have to register it with Kryo 
even though it extends Serializable?

Reply via email to