[jira] [Commented] (MAHOUT-621) Support more data import mechanisms

Julien Nioche (JIRA) Wed, 30 Mar 2011 06:55:48 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012960#comment-13012960
 ]


Julien Nioche commented on MAHOUT-621:
--------------------------------------

>From https://issues.apache.org/jira/browse/MAHOUT-368

{quote} > Why not having a bundle artifact where all the Mahout submodules 
would be put it a single jar? 

 How is this not trivial for you to handle with maven?
 If you are writing your own maven project (recommended), then 
jar-with-dependencies will do what you want.
 If you are extending Mahout (ok for prototypes), just put your code in the 
examples job jar and all will be good.
{quote}

I am not extending Mahout and as you've probably seen in the comments above the 
point is to be able to generate Mahout data structures from Behemoth so putting 
the code in examples is not an option anyway.

Back to the original problem. I generate a job file for my Mahout module in 
Behemoth (https://github.com/jnioche/behemoth/tree/master/modules/mahout) and 
manage the dependencies with Ivy. The main class (SparseVectorsFromBehemoth) is 
a slightly modified version of SparseVectorsFromSequenceFiles which gets the 
Tokens from Behemoth documents instead of using Lucene and generates the data 
structures expected by the classifiers and clusterers.

The job file contains : 
 * the Behemoth classes for the Mahout module
 * the dependencies in /lib including
  ** mahout-math-0.4.jar
  ** mahout-core-0.4.jar

The problem I had was the same as Han Hui Wen (MAHOUT-368) i.e I was getting a 
class not found exception on org.apache.mahout.math.VectorWritable. My 
understanding of the problem is that my main class calls DictionaryVectorizer 
which in my job file was in lib/mahout-core-0.4.jar and this has a dependency 
on VectorWritable which is in lib/mahout-maths-0.4.jar.  For some reason 
MapReduce was not able to find VectorWritable, which I assume has to do with 
the jobs in DictionaryVectorizer calling 
'job.setJarByClass(DictionaryVectorizer.class)'.

I could of course use jar-with-dependencies on the Mahout code and generate a 
single jar then manage the jar locally. However this means that I have very 
little control over the dependencies used by Mahout (e.g. potentially 
conflicting versions with other components in my job files) and I'd rather rely 
on external publicised jars anyway. A better option would be to simply unpack 
the content of the mahout core and maths jars into the root of my job file. At 
least the Mahout dependencies would be handled and versioned normally. 

I've tried with Hadoop 0.21.0 and did not get this issue so I suppose that 
something must have changed in the way the classloader handles dependencies 
within a job file. 

Makes sense?


> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-621) Support more data import mechanisms

Reply via email to