[ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260246#comment-14260246
 ] 

ASF GitHub Bot commented on MAHOUT-1636:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

    https://github.com/apache/mahout/pull/69#issuecomment-68277868
  
    I'll be out of town until the 31st.. will test NB drivers then.  I've only 
tested them locally.   Haven't been able to really follow the conversation on 
this issue.
    
    
    Sent from my Verizon Wireless 4G LTE smartphone
    
    <div>-------- Original message --------</div><div>From: Pat Ferrel 
<[email protected]> </div><div>Date:12/29/2014  12:00 PM  (GMT-05:00) 
</div><div>To: apache/mahout <[email protected]> </div><div>Subject: 
[mahout] MAHOUT-1636 (#69) </div><div>
    </div>
    Started out simplifying driver code and making changes to all drivers to 
support that. Then ran into the fat job.jar issue of MAHOUT-1636 so created a 
slimmed down version of the old job.jar by adding excludes to job.xml and 
changing the name to "dependencies.jar"
    
    The new jar works for spark-itemsimilarity and spark-row-similarity but 
needs to be tested for the naive bayes drivers.
    
    The dependencies.jar still contains a lot of stuff from mrlegacy, some is 
in external projects, like jackson that can be excluded with this mechanism but 
also a lot of mahout code that is unneeded in this jar. This later case would 
require some other mechanism than a simple <exclude> clause in the assembly xml 
file.
    
    I believe the new dependencies.jar is the only thing that needs to be on 
the classpath when running spark drivers or the spark-shell. I haven't changed 
this but it is a further refinement we can try.
    You can merge this Pull Request by running:
    
      git pull https://github.com/pferrel/mahout MAHOUT-1636
    
    Or you can view, comment on it, or merge it online at:
    
      https://github.com/apache/mahout/pull/69
    
    -- Commit Summary --
    
      * simplified driver and made required changes to all, note: left job 
assembly untouched
      * creating a trimmed down all-deps dependencies.jar for spark drivers
    
    -- File Changes --
    
        M 
math-scala/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala (2)
        M spark/pom.xml (13)
        R spark/src/main/assembly/dependencies.xml (22)
        M 
spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala (12)
        M 
spark/src/main/scala/org/apache/mahout/drivers/MahoutSparkDriver.scala (20)
        M 
spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala (8)
        M spark/src/main/scala/org/apache/mahout/drivers/TestNBDriver.scala (64)
        M spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala 
(18)
    
    -- Patch Links --
    
    https://github.com/apache/mahout/pull/69.patch
    https://github.com/apache/mahout/pull/69.diff
    
    ---
    Reply to this email directly or view it on GitHub:
    https://github.com/apache/mahout/pull/69


> Class dependencies for the spark module are put in a job.jar, which is very 
> inefficient
> ---------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1636
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1636
>             Project: Mahout
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.0-snapshot
>            Reporter: Pat Ferrel
>            Assignee: Ted Dunning
>             Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all 
> dependencies including transitive ones. This job.jar is in 
> mahout/spark/target and is included in the classpath when a Spark job is run. 
> This allows dependency classes to be found at runtime but the job.jar include 
> a great deal of things not needed that are duplicates of classes found in the 
> main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
> needed classes. A better way needs to be implemented for including class 
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for 
> now. Whoever picks up this Jira will have to remove it after deciding on a 
> better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to