[
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260222#comment-14260222
]
Pat Ferrel commented on MAHOUT-1636:
------------------------------------
The "front-end" gets the classpath created in the mahout script.
The "back-end" gets the classpath created in mahoutSparkContext which uses
"mahout classpath -spark" and allows for a list of special purpose jars (not
used internal to Mahout)
The output of "mahout classpath" and "mahout classpath -spark" are identical in
my case. For the back-end mahoutSparkContext has the chance to modify the
classpath or add jars but does not.
So there may be some refactoring to do here. For instance, why are the two cps
identical? Surely there are more things we can exclude when running hadoop
drivers. Also should we be using something like spark-submit to launch spark
drivers?
As a first step I'll try creating a new "dependencies.jar" assembly which has
all transitives for spark drivers excluding anything that seems unneeded or is
already guaranteed by the environment. I believe that the only way to test this
will be to run all drivers from the CLI since scalatest during build uses a
different methods for finding classes. See
https://github.com/apache/mahout/pull/69 for further discussion.
[~tdunning] I assume this doesn't duplicate anything you are doing on this
ticket?
> Class dependencies for the spark module are put in a job.jar, which is very
> inefficient
> ---------------------------------------------------------------------------------------
>
> Key: MAHOUT-1636
> URL: https://issues.apache.org/jira/browse/MAHOUT-1636
> Project: Mahout
> Issue Type: Bug
> Components: spark
> Affects Versions: 1.0-snapshot
> Reporter: Pat Ferrel
> Assignee: Ted Dunning
> Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all
> dependencies including transitive ones. This job.jar is in
> mahout/spark/target and is included in the classpath when a Spark job is run.
> This allows dependency classes to be found at runtime but the job.jar include
> a great deal of things not needed that are duplicates of classes found in the
> main mrlegacy job.jar. If the job.jar is removed, drivers will not find
> needed classes. A better way needs to be implemented for including class
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for
> now. Whoever picks up this Jira will have to remove it after deciding on a
> better method.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)