[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

Andrew Palumbo (JIRA) Thu, 01 Jan 2015 10:33:30 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14262632#comment-14262632
 ]


Andrew Palumbo commented on MAHOUT-1636:
----------------------------------------

As far as the build- I have to study up a bit.  I am far from a build engineer. 
 One thing that I'm wondering,  not so much about the build, but about the 
project structure going forward is whether or not we need to MR-Legacy 
dependency in the Spark module.  I know that the idea has been floated a few 
times by at least you and Dmitriy of having a `scala-core` and something like 
an "all things hadoop"  or "java-hadoop" (if i remember correctly) module.  

so what about something like `scala-core` and `java-core` modules?

As far as I can tell (at compile time at least)  the only java classes the 
`Spark` module needs from `MR-Legacy` are: `MatrixWritable`, `VectorWritable`, 
`Pair` and `IOUtils`.  So we should be able to move these out of `MRLegacy` and 
into `java-core` easily. I think that this would make sense anyways since these 
are not they're not really "Legacy" rather actively used classes.   

This way we can have the $MAHOUT classpath -spark skip over the `MRLegacy` 
dependencies and pick up jars from a slimmed down `java-core` module instead.

[~pferrel] - i just saw that you sent an email out - I'll just post this here 
anyways since it's more about directory structure than the missing /lib and 
maven work that we need to do.  I'll respond there after I get a better 
understanding of whats going on with that.  

> Class dependencies for the spark module are put in a job.jar, which is very 
> inefficient
> ---------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1636
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1636
>             Project: Mahout
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.0-snapshot
>            Reporter: Pat Ferrel
>            Assignee: Ted Dunning
>             Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all 
> dependencies including transitive ones. This job.jar is in 
> mahout/spark/target and is included in the classpath when a Spark job is run. 
> This allows dependency classes to be found at runtime but the job.jar include 
> a great deal of things not needed that are duplicates of classes found in the 
> main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
> needed classes. A better way needs to be implemented for including class 
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for 
> now. Whoever picks up this Jira will have to remove it after deciding on a 
> better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

Reply via email to