There is an assembly xml in mahout/spark/src/main/assembly/dependency-reduced.xml. It contains dependencies that are external to mahout but required for either the client or backend executor distributed code.
Guava has recently been removed but scopt is still used by the client. For some reason the following artifacts were added to the assembly and I’m not sure why. This is only used with Spark. <includes> <include>com.github.scopt</include> <include>com.tdunning:t-digest</include> <include>org.apache.commons:commons-math3</include> </includes> Are these all used? Does anyone know where t-digest and math3 came from? I’d also like to propose that we create two jars, one for client and one for backend executors. There are three configs we need to work in, spark alone, yarn-cleint, and yarn-cluster. All these modes separate the needs of the client from the backend executors but have slightly different ways to get the classes needed for each. I think separating into client and backend dependencies jars will cover all cases but we’ll have to explain how to launch code in each mode.
