[
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14262632#comment-14262632
]
Andrew Palumbo commented on MAHOUT-1636:
----------------------------------------
As far as the build- I have to study up a bit. I am far from a build engineer.
One thing that I'm wondering, not so much about the build, but about the
project structure going forward is whether or not we need to MR-Legacy
dependency in the Spark module. I know that the idea has been floated a few
times by at least you and Dmitriy of having a `scala-core` and something like
an "all things hadoop" or "java-hadoop" (if i remember correctly) module.
so what about something like `scala-core` and `java-core` modules?
As far as I can tell (at compile time at least) the only java classes the
`Spark` module needs from `MR-Legacy` are: `MatrixWritable`, `VectorWritable`,
`Pair` and `IOUtils`. So we should be able to move these out of `MRLegacy` and
into `java-core` easily. I think that this would make sense anyways since these
are not they're not really "Legacy" rather actively used classes.
This way we can have the $MAHOUT classpath -spark skip over the `MRLegacy`
dependencies and pick up jars from a slimmed down `java-core` module instead.
[~pferrel] - i just saw that you sent an email out - I'll just post this here
anyways since it's more about directory structure than the missing /lib and
maven work that we need to do. I'll respond there after I get a better
understanding of whats going on with that.
> Class dependencies for the spark module are put in a job.jar, which is very
> inefficient
> ---------------------------------------------------------------------------------------
>
> Key: MAHOUT-1636
> URL: https://issues.apache.org/jira/browse/MAHOUT-1636
> Project: Mahout
> Issue Type: Bug
> Components: spark
> Affects Versions: 1.0-snapshot
> Reporter: Pat Ferrel
> Assignee: Ted Dunning
> Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all
> dependencies including transitive ones. This job.jar is in
> mahout/spark/target and is included in the classpath when a Spark job is run.
> This allows dependency classes to be found at runtime but the job.jar include
> a great deal of things not needed that are duplicates of classes found in the
> main mrlegacy job.jar. If the job.jar is removed, drivers will not find
> needed classes. A better way needs to be implemented for including class
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for
> now. Whoever picks up this Jira will have to remove it after deciding on a
> better method.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)