[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

Pat Ferrel (JIRA) Mon, 29 Dec 2014 09:01:34 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260222#comment-14260222
 ]


Pat Ferrel commented on MAHOUT-1636:
------------------------------------

The "front-end" gets the classpath created in the mahout script. 

The "back-end" gets the classpath created in mahoutSparkContext which uses 
"mahout classpath -spark" and allows for a list of special purpose jars (not 
used internal to Mahout)

The output of "mahout classpath" and "mahout classpath -spark" are identical in 
my case. For the back-end mahoutSparkContext has the chance to modify the 
classpath or add jars but does not.

So there may be some refactoring to do here. For instance, why are the two cps 
identical? Surely there are more things we can exclude when running hadoop 
drivers. Also should we be using something like spark-submit to launch spark 
drivers?

As a first step I'll try creating a new "dependencies.jar" assembly which has 
all transitives for spark drivers excluding anything that seems unneeded or is 
already guaranteed by the environment. I believe that the only way to test this 
will be to run all drivers from the CLI since scalatest during build uses a 
different methods for finding classes. See 
https://github.com/apache/mahout/pull/69 for further discussion.

[~tdunning] I assume this doesn't duplicate anything you are doing on this 
ticket?

> Class dependencies for the spark module are put in a job.jar, which is very 
> inefficient
> ---------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1636
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1636
>             Project: Mahout
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.0-snapshot
>            Reporter: Pat Ferrel
>            Assignee: Ted Dunning
>             Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all 
> dependencies including transitive ones. This job.jar is in 
> mahout/spark/target and is included in the classpath when a Spark job is run. 
> This allows dependency classes to be found at runtime but the job.jar include 
> a great deal of things not needed that are duplicates of classes found in the 
> main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
> needed classes. A better way needs to be implemented for including class 
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for 
> now. Whoever picks up this Jira will have to remove it after deciding on a 
> better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

Reply via email to