[ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258621#comment-14258621
 ] 

Pat Ferrel commented on MAHOUT-1636:
------------------------------------

I think the job.jar is just an assembly of other jars as specified in job.xml, 
which could be called anything. AFAIK there is nothing specific about the 
format and the jvm certainly does recognize and load classes from the a job.jar 
in the front end part of the driver.

Can you point me to something that shows how the backend classes are loaded 
differently?

The shell depends on spark module, which depends on mrlegacy and since the only 
place transitive dependencies are assembled is the mrlegacy job jar I suspect 
the DSL+shell will have holes if you do away with the job jar. 

Seems like we have two cases, hadoop mapreduce, which is covered. And Spark, 
which does need an all deps jar (minus HDFS, Spark, and Scala). This means at 
least two classpaths, which we have. We are missing a Spark assembly that we 
can agree on. I'm terrible at POMs but will see it I can figure a way to 
exclude HDFS, Spark, and Scala from the current job jar in the spark module. 
This should get us most of the way to an agreeable solution.

> Class dependencies for the spark module are put in a job.jar, which is very 
> inefficient
> ---------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1636
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1636
>             Project: Mahout
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.0-snapshot
>            Reporter: Pat Ferrel
>            Assignee: Ted Dunning
>             Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all 
> dependencies including transitive ones. This job.jar is in 
> mahout/spark/target and is included in the classpath when a Spark job is run. 
> This allows dependency classes to be found at runtime but the job.jar include 
> a great deal of things not needed that are duplicates of classes found in the 
> main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
> needed classes. A better way needs to be implemented for including class 
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for 
> now. Whoever picks up this Jira will have to remove it after deciding on a 
> better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to