I am just starting out playing with spark on our hadoop 2.2 cluster and I have 
a question.

The current way to submit jobs to the cluster is to create fat-jars with sbt 
assembly. This approach works but I think is less than optimal in many large 
hadoop installation:

the way we interact with the cluster is to log into a CLI machine, which is the 
only authorized to submit jobs. Now, I can not use the CLI machine as a dev 
environment since for security reason the CLI and hadoop cluster is fire-walled 
and can not reach out to the internet, so sbt and manven resolution does not 
work.

So the procedure now is:
- hack code
- sbt assembly
- rsync my spark directory to the CLI machine
- run my job.

the issue is that every time i need to shuttle large binary files (all the 
fat-jars) back and forth, they are about 120Mb now, which is slow, particularly 
when I am working remotely from home.

I was wondering whether a better solution would be to create normal thin-jars 
of my code, which is very small, less than a Mb, and have no problem to copy 
every time to the cluster, but to take advantage of the sbt-create directory 
lib_managed to handle dependencies. We already have this directory that sbt 
handles with all the needed dependencies for the job to run. Wouldn’t be 
possible to have the Spark Yarn Client take care of adding all the jars in 
lib_managed to class path and distribute them to the workers automatically (and 
they could also be cached across invocations of spark, after all those jars are 
versioned and immutable, with the possible exception of -SNAPSHOT releases). I 
think that this would greatly simplify the development procedure and remove the 
need of messing with ADD_JAR and SPARK_CLASSPATH.

What do you think?

Alex 

Reply via email to