[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

Dmitriy Lyubimov (JIRA) Wed, 24 Dec 2014 10:39:07 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258431#comment-14258431
 ]


Dmitriy Lyubimov commented on MAHOUT-1636:
------------------------------------------

bq. BTW I'd be surprised if the DSL+Mahout shell will run properly without the 
mrlegacy job jar. try removing it and using something from mrlegacy with 
transitive external dependencies.
Hm. i though legacy was excluded. Certainly legacy job should be excluded (for 
once, if i understand it correctly, it is MR job format, not shaded format, jvm 
doesn't eat it). Let me try to clean that out. I am pretty sure shell front end 
doesn't use any of that stuff.


Actually i am not so much concerned about paths in front end. They can include 
sweepingly whatever, as long as there are no incompatibilities. It is more the 
backend path i am worried about. It is currently carefully crafted to include 
the minimum subset of jars really used by algebra and not already provided by 
spark itself (i think there are only 4 or 5 of them). This essential for quick 
session set up (as they are being broadcast with every spark session setup and 
also copy into temp dirs on the workers with every session, thus trashing the 
spaces there). There certainly should be no job jars in the backend classpath, 
if mahout really wants to stay relevant to fast cluster computing. 

> Class dependencies for the spark module are put in a job.jar, which is very 
> inefficient
> ---------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1636
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1636
>             Project: Mahout
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.0-snapshot
>            Reporter: Pat Ferrel
>            Assignee: Ted Dunning
>             Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all 
> dependencies including transitive ones. This job.jar is in 
> mahout/spark/target and is included in the classpath when a Spark job is run. 
> This allows dependency classes to be found at runtime but the job.jar include 
> a great deal of things not needed that are duplicates of classes found in the 
> main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
> needed classes. A better way needs to be implemented for including class 
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for 
> now. Whoever picks up this Jira will have to remove it after deciding on a 
> better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

Reply via email to