[ 
https://issues.apache.org/jira/browse/MAHOUT-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732435#action_12732435
 ] 

Robin Anil commented on MAHOUT-146:
-----------------------------------

Its already generic to some extend. Check out Mahout-60 

The usage was to create a dataset for BayesClassifier like this.
{noformat}
hadoop jar build/apache-mahout-0.1-dev-ex.jar 
org.apache.mahout.examples.classifiers.cbayes.WikipediaDatasetCreator -i 
wikipediadump -o wikipediainput -c pathto/country.txt
{noformat}

-c is the file with the list of categories(wikipedia categories). So you could 
specify any thing there. But it has to be a wikipedia category. When i go 
thought the xml dump, in the Map stage, for every article I match the list of 
categories with categories the document is in and output it if a match occurs

for example, you could create 2 categories by adding the following in the 
categories file
Scientists of 1900
Scientists of 2000

> Make Wikipedia Example Classifier more generic
> ----------------------------------------------
>
>                 Key: MAHOUT-146
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-146
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.2
>
>
> It would be nice if the Wikipedia classifier example was a bit more generic 
> instead of taking just countries.  For example, one could classify based on 
> other types of categories, such as things like "subjects", i.e. History, 
> Math, Science or other things.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to