After rummaging through the worker instances I noticed they were using the
assembly jar (which I hadn't noticed before). Now instead of using the core
and mllib jars individually, I'm just overwriting the assembly jar in the
master and using spark-ec2/copy-dir. For posterity, my run script is:

MASTER=r...@ec2-54-224-110-72.compute-1.amazonaws.com
PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar
ASSEMBLY_SRC=spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
ASSEMBLY_DEST=spark-assembly-1.0.1-hadoop1.0.4.jar
FORESTRY_DIR=~/src/forestry-main
SPARK_DIR=~/src/spark-dev
cd $SPARK_DIR
mvn -T8 -DskipTests -pl core,mllib,assembly install
cd $FORESTRY_DIR
mvn -T8 -DskipTests package
rsync --progress ~/src/spark-dev/assembly/target/scala-2.10/$ASSEMBLY_SRC
$MASTER:spark/lib/$ASSEMBLY_DEST
rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER:
rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf
ssh $MASTER "spark-ec2/copy-dir --delete /root/spark/lib"
ssh $MASTER "spark/bin/spark-submit $PRIMARY_JAR --class
com.ttforbes.TreeTest --verbose"



On Mon, Aug 4, 2014 at 10:23 AM, Matt Forbes <m...@tellapart.com> wrote:

> I'm trying to run a forked version of mllib where I am experimenting with
> a boosted trees implementation. Here is what I've tried, but can't seem to
> get working properly:
>
> *Directory layout:*
>
> src/spark-dev  (spark github fork)
>   pom.xml - I've tried changing the version to 1.2 arbitrarily in core and
> mllib
> src/forestry  (test driver)
>   pom.xml - depends on spark-core and spark-mllib with version 1.2
>
> *spark-defaults.conf:*
>
> spark.master                    spark://
> ec2-54-224-112-117.compute-1.amazonaws.com:7077
> spark.verbose                   true
> spark.files.userClassPathFirst  false  # I've tried both true and false
> here
> spark.executor-memory           6G
> spark.jars
>  
> spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar
>
> *Build and run script:*
>
> MASTER=r...@ec2-54-224-112-117.compute-1.amazonaws.com
> PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar
> FORESTRY_DIR=~/src/forestry-main
> SPARK_DIR=~/src/spark-dev
> cd $SPARK_DIR
> mvn -T8 -DskipTests -pl core,mllib,streaming install
> cd $FORESTRY_DIR
> mvn -T8 -DskipTests package
> rsync --progress
> ~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER:
> rsync --progress
> ~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER:
> rsync --progress
> ~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar
> $MASTER:
> rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER:
> rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf
> ssh $MASTER "spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest
> --verbose"
>
> In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm
> referencing from TreeTest in my test driver. The driver pulls some data
> from s3, converts to LabeledPoint, and then calls
> GradientBoostingTree.train(...) identically to how DecisionTree works. This
> is all fine until it we call examples.map { x => tree.predict(x.features) }
> where tree is a DecisionTree that I've also modified in my fork. At this
> point, the workers blow up because they can't find a new method I've added
> to the tree.model.Node class. My suspicion is that maybe the workers have
> deserialized the DecisionTreeModel into a different version of mllib that
> doesn't have my changes?
>
> Is my setup all wrong? I'm using an EC2 cluster because it is so easy to
> startup and manage, maybe I need to fully distribute my new version of
> spark to all the workers before starting the job? Is there an easy way to
> do that?
>
>
>
>
>
>
>

Reply via email to