In lieu of the real story from our Spark experts here’s what seems to be
happening. This will affect anyone using broadcast or (I think) accumulators
with Spark. I’ve run into them and needed them for several things getting the
CLI for itemsimilarity to work.
We (Mahout) seem to be using a non-default but more performant serializer
called Kryo. In order to broadcast an object for read-only access to all Spark
nodes Kryo must be able to serialize the broadcast object. Just about any class
that extends Serializable will work and I thought it would do so without
telling Kryo anything, but in practice I had to add the following to the
existing Kryo setup:
class MahoutKryoRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) = {
kryo.addDefaultSerializer(classOf[Vector], new
WritableKryoSerializer[Vector, VectorWritable])
kryo.addDefaultSerializer(classOf[DenseVector], new
WritableKryoSerializer[Vector, VectorWritable])
kryo.addDefaultSerializer(classOf[Matrix], new
WritableKryoSerializer[Matrix, MatrixWritable])
// taking the following line from Kryo docs because I need to broadcast a
HashBiMap
kryo.register(classOf[com.google.common.collect.HashBiMap[String, Int]],
new JavaSerializer());
}
}
So it seems if you are broadcasting anything not in this list, you’ll need to
add your class. the JavaSerializer is from Kryo so
import com.esotericsoftware.kryo.serializers.JavaSerializer
the itemsimilairty CLI now works on an HDFS + Spark cluster. I had to change
Poms, added a jar assembly/job.xml, edited the mahout script and will close
three tickets with https://github.com/apache/mahout/pull/22 so if you can spare
some time please take a look at it.
On Jun 26, 2014, at 4:04 PM, Pat Ferrel <[email protected]> wrote:
If I want Spark to broadcast a Guava class do I have to register it with Kryo
even though it extends Serializable?