[
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257378#comment-14257378
]
Dmitriy Lyubimov commented on MAHOUT-1636:
------------------------------------------
presumably, if we have missing classes, we'd encounter ClassNotFound exception.
if the exception is thrown in backend slave jvm, that would be backend
classpath.
if the exception is thrown in the driver jvm, that would be front end classpath
problem.
The reason is that both classpaths are configured differently.
Backend classpath is formed by createMahoutContext, with an option to throw in
more jars.
The front end classpath is managed mostly by the starter script (such as
`mahout spark-shell` starter).
If the problem is (as i suspect) in backend then we should go on case by case
basis.
If the problem is caused by algorithm closure code that driver uses, it is
driver's responsibility to find and add additional jars to backend, unless it
is something overly common (like apache-math) and we can agree to make it
standard (all) job's scope.
if the problem is caused by execution of DRM code, then it is a bug and
standard mahout context needs to be tweaked to include missing dependencies.
The surest way to test it is to run all unit tests with non-local master. (I
thought i did a hack for it but apparently it was on a proprietary branch that
i never committed back).
Next, if it is a driver's closure code that is causes this and is only
pertinent to this particular method, then the driver should check 'mahout
classpath' output if the jar is there, and use its location from there (the
same way createMahoutContext call does).
if the jar is not found there, then we need to re-examine how "mahout
classpath" works and perhaps why we need what we miss -- there is a fairly big
chance we don't need what "mahout classpath" doesn't show, because we haven't
needed it before ever in history of the project.
> Class dependencies for the spark module are put in a job.jar, which is very
> inefficient
> ---------------------------------------------------------------------------------------
>
> Key: MAHOUT-1636
> URL: https://issues.apache.org/jira/browse/MAHOUT-1636
> Project: Mahout
> Issue Type: Bug
> Components: spark
> Affects Versions: 1.0-snapshot
> Reporter: Pat Ferrel
> Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all
> dependencies including transitive ones. This job.jar is in
> mahout/spark/target and is included in the classpath when a Spark job is run.
> This allows dependency classes to be found at runtime but the job.jar include
> a great deal of things not needed that are duplicates of classes found in the
> main mrlegacy job.jar. If the job.jar is removed, drivers will not find
> needed classes. A better way needs to be implemented for including class
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for
> now. Whoever picks up this Jira will have to remove it after deciding on a
> better method.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)