[ 
https://issues.apache.org/jira/browse/MAHOUT-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-451:
-------------------------------

    Attachment: MAHOUT-451.patch

Basic implementation of training/test split utility, runnable via:

{code}
./bin/mahout -core org.apache.mahout.classifier.bayes.SplitBayesInput -i 
~/mahout/bayes/20news-input -tr ~/mahout/bayes/20news-train-x -te 
~/mahout/bayes/20news-test-x -c UTF-8 -s 100
{code}

This will write the last 100 documents in each input file to the test output 
(-te) directory and write the rest to the training directory (-tr)

Should probably be updated to use hadoop fs primitives instead of java.io.File, 
etc.

> Simple utility to split bayes input into training/test sets
> -----------------------------------------------------------
>
>                 Key: MAHOUT-451
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-451
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Priority: Minor
>         Attachments: MAHOUT-451.patch
>
>
> Provides a simply utility that you point at a directory containing files in 
> Bayes classifier input format. Given the number of documents to write to the 
> test set, for each input file it will produce files in two output 
> directories, one containing training data with the test documents removed and 
> a second containing the test documents. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to