I am just starting out playing with spark on our hadoop 2.2 cluster and I have a question.
The current way to submit jobs to the cluster is to create fat-jars with sbt assembly. This approach works but I think is less than optimal in many large hadoop installation: the way we interact with the cluster is to log into a CLI machine, which is the only authorized to submit jobs. Now, I can not use the CLI machine as a dev environment since for security reason the CLI and hadoop cluster is fire-walled and can not reach out to the internet, so sbt and manven resolution does not work. So the procedure now is: - hack code - sbt assembly - rsync my spark directory to the CLI machine - run my job. the issue is that every time i need to shuttle large binary files (all the fat-jars) back and forth, they are about 120Mb now, which is slow, particularly when I am working remotely from home. I was wondering whether a better solution would be to create normal thin-jars of my code, which is very small, less than a Mb, and have no problem to copy every time to the cluster, but to take advantage of the sbt-create directory lib_managed to handle dependencies. We already have this directory that sbt handles with all the needed dependencies for the job to run. Wouldn’t be possible to have the Spark Yarn Client take care of adding all the jars in lib_managed to class path and distribute them to the workers automatically (and they could also be cached across invocations of spark, after all those jars are versioned and immutable, with the possible exception of -SNAPSHOT releases). I think that this would greatly simplify the development procedure and remove the need of messing with ADD_JAR and SPARK_CLASSPATH. What do you think? Alex