After rummaging through the worker instances I noticed they were using the assembly jar (which I hadn't noticed before). Now instead of using the core and mllib jars individually, I'm just overwriting the assembly jar in the master and using spark-ec2/copy-dir. For posterity, my run script is:
MASTER=r...@ec2-54-224-110-72.compute-1.amazonaws.com PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar ASSEMBLY_SRC=spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar ASSEMBLY_DEST=spark-assembly-1.0.1-hadoop1.0.4.jar FORESTRY_DIR=~/src/forestry-main SPARK_DIR=~/src/spark-dev cd $SPARK_DIR mvn -T8 -DskipTests -pl core,mllib,assembly install cd $FORESTRY_DIR mvn -T8 -DskipTests package rsync --progress ~/src/spark-dev/assembly/target/scala-2.10/$ASSEMBLY_SRC $MASTER:spark/lib/$ASSEMBLY_DEST rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER: rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf ssh $MASTER "spark-ec2/copy-dir --delete /root/spark/lib" ssh $MASTER "spark/bin/spark-submit $PRIMARY_JAR --class com.ttforbes.TreeTest --verbose" On Mon, Aug 4, 2014 at 10:23 AM, Matt Forbes <m...@tellapart.com> wrote: > I'm trying to run a forked version of mllib where I am experimenting with > a boosted trees implementation. Here is what I've tried, but can't seem to > get working properly: > > *Directory layout:* > > src/spark-dev (spark github fork) > pom.xml - I've tried changing the version to 1.2 arbitrarily in core and > mllib > src/forestry (test driver) > pom.xml - depends on spark-core and spark-mllib with version 1.2 > > *spark-defaults.conf:* > > spark.master spark:// > ec2-54-224-112-117.compute-1.amazonaws.com:7077 > spark.verbose true > spark.files.userClassPathFirst false # I've tried both true and false > here > spark.executor-memory 6G > spark.jars > > spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar > > *Build and run script:* > > MASTER=r...@ec2-54-224-112-117.compute-1.amazonaws.com > PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar > FORESTRY_DIR=~/src/forestry-main > SPARK_DIR=~/src/spark-dev > cd $SPARK_DIR > mvn -T8 -DskipTests -pl core,mllib,streaming install > cd $FORESTRY_DIR > mvn -T8 -DskipTests package > rsync --progress > ~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER: > rsync --progress > ~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER: > rsync --progress > ~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar > $MASTER: > rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER: > rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf > ssh $MASTER "spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest > --verbose" > > In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm > referencing from TreeTest in my test driver. The driver pulls some data > from s3, converts to LabeledPoint, and then calls > GradientBoostingTree.train(...) identically to how DecisionTree works. This > is all fine until it we call examples.map { x => tree.predict(x.features) } > where tree is a DecisionTree that I've also modified in my fork. At this > point, the workers blow up because they can't find a new method I've added > to the tree.model.Node class. My suspicion is that maybe the workers have > deserialized the DecisionTreeModel into a different version of mllib that > doesn't have my changes? > > Is my setup all wrong? I'm using an EC2 cluster because it is so easy to > startup and manage, maybe I need to fully distribute my new version of > spark to all the workers before starting the job? Is there an easy way to > do that? > > > > > > >