[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

ASF GitHub Bot (JIRA) Mon, 29 Dec 2014 12:35:32 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260421#comment-14260421
 ]


ASF GitHub Bot commented on MAHOUT-1636:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

    https://github.com/apache/mahout/pull/69#discussion_r22327470
  
    --- Diff: spark/src/main/assembly/dependencies.xml ---
    @@ -38,9 +38,34 @@
           <outputDirectory>/</outputDirectory>
           <useTransitiveFiltering>true</useTransitiveFiltering>
           <excludes>
    +        <!-- MAHOUT-1636 -->
    +        <!-- add any projects that are included in the spark environment 
or are in mrlegacy
    +        but not used in spark drivers -->
             <exclude>org.apache.hadoop:hadoop-core</exclude>
    +        <exclude>org.apache.spark:spark-core_${scala.major}</exclude>
    +        <exclude>org.scala-lang:scala-library</exclude>
    +        <exclude>jackson-core-asl</exclude>
    +        <exclude>jackson-mapper-asl</exclude>
    +        <exclude>xstream</exclude>
    +        <exclude>lucene-core</exclude>
    +        <exclude>lucene-analyzers-common</exclude>
           </excludes>
         </dependencySet>
    --- End diff --
    
    Sounds good - push whenever u think is good- I don't think there will be 
any problems...   the only difference that I can think of from the basic 
structure of the  item-similarity driver is that the NB driver makea calls to a 
SparkNB object from the spark module which overrides the math-scala 
implementation and calls spark aggregateByKey(...). So as long as that's 
available and math-scala is available there shouldn't be any issues. 
    
    
    
    Sent from my Verizon Wireless 4G LTE smartphone
    
    <div>-------- Original message --------</div><div>From: Pat Ferrel 
<[email protected]> </div><div>Date:12/29/2014  3:03 PM  (GMT-05:00) 
</div><div>To: apache/mahout <[email protected]> </div><div>Cc: Andrew 
Palumbo <[email protected]> </div><div>Subject: Re: [mahout] MAHOUT-1636 (#69) 
</div><div>
    </div>In spark/src/main/assembly/dependencies.xml:
    
    >          <exclude>org.apache.hadoop:hadoop-core</exclude>
    > +        <exclude>org.apache.spark:spark-core_${scala.major}</exclude>
    > +        <exclude>org.scala-lang:scala-library</exclude>
    > +        <exclude>jackson-core-asl</exclude>
    > +        <exclude>jackson-mapper-asl</exclude>
    > +        <exclude>xstream</exclude>
    > +        <exclude>lucene-core</exclude>
    > +        <exclude>lucene-analyzers-common</exclude>
    >        </excludes>
    >      </dependencySet>
    This is as many as seem safe. Lots inside mrlegacy that could be excluded 
but its all in the same artifact so leaving in unless someone knows how to 
exclude particular partial packages.
    
    Won't change the code to trim things from the classpath in this commit but 
I suspect the dependencies.jar may be all that is needed for spark-shell and 
drivers.
    
    @andrewpalumbo there's little chance this will mess up your drivers so I 
may push this after some more testing on my side.
    
    —
    Reply to this email directly or view it on GitHub.
    



> Class dependencies for the spark module are put in a job.jar, which is very 
> inefficient
> ---------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1636
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1636
>             Project: Mahout
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.0-snapshot
>            Reporter: Pat Ferrel
>            Assignee: Ted Dunning
>             Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all 
> dependencies including transitive ones. This job.jar is in 
> mahout/spark/target and is included in the classpath when a Spark job is run. 
> This allows dependency classes to be found at runtime but the job.jar include 
> a great deal of things not needed that are duplicates of classes found in the 
> main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
> needed classes. A better way needs to be implemented for including class 
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for 
> now. Whoever picks up this Jira will have to remove it after deciding on a 
> better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

Reply via email to