[ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258373#comment-14258373
 ] 

Pat Ferrel commented on MAHOUT-1636:
------------------------------------

This may not be all that complicated. 

1) we need an all-deps artifact. Strictly speaking the deps that are already in 
Spark or Scala aren't needed because they are in the environment with Spark. 
But put that aside for the moment.
2) There is an artifact being created with all dependencies including 
transitive ones that are needed for all spark code except the shell. That is 
the job.jar in the spark module.
3) The main problem is that much of this is duplicated in the fat mrlegacy job 
jar.

So it seems if we drop the mrlegacy job.jar from the driver classpath we'd have 
a better solution. The only remaining issue is that the new job jar will 
contain some things not needed from Spark and Scala. Maybe we can exclude some 
of those with a little exclusion maven-fu.

BTW I'd be surprised if the DSL+Mahout shell will run properly without the 
mrlegacy job jar. try removing it and using something from mrlegacy with 
transitive external dependencies.

> Class dependencies for the spark module are put in a job.jar, which is very 
> inefficient
> ---------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1636
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1636
>             Project: Mahout
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.0-snapshot
>            Reporter: Pat Ferrel
>             Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all 
> dependencies including transitive ones. This job.jar is in 
> mahout/spark/target and is included in the classpath when a Spark job is run. 
> This allows dependency classes to be found at runtime but the job.jar include 
> a great deal of things not needed that are duplicates of classes found in the 
> main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
> needed classes. A better way needs to be implemented for including class 
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for 
> now. Whoever picks up this Jira will have to remove it after deciding on a 
> better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to