[
https://issues.apache.org/jira/browse/MAHOUT-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Drew Farris updated MAHOUT-451:
-------------------------------
Attachment: MAHOUT-451.patch
Basic implementation of training/test split utility, runnable via:
{code}
./bin/mahout -core org.apache.mahout.classifier.bayes.SplitBayesInput -i
~/mahout/bayes/20news-input -tr ~/mahout/bayes/20news-train-x -te
~/mahout/bayes/20news-test-x -c UTF-8 -s 100
{code}
This will write the last 100 documents in each input file to the test output
(-te) directory and write the rest to the training directory (-tr)
Should probably be updated to use hadoop fs primitives instead of java.io.File,
etc.
> Simple utility to split bayes input into training/test sets
> -----------------------------------------------------------
>
> Key: MAHOUT-451
> URL: https://issues.apache.org/jira/browse/MAHOUT-451
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Affects Versions: 0.3
> Reporter: Drew Farris
> Priority: Minor
> Attachments: MAHOUT-451.patch
>
>
> Provides a simply utility that you point at a directory containing files in
> Bayes classifier input format. Given the number of documents to write to the
> test set, for each input file it will produce files in two output
> directories, one containing training data with the test documents removed and
> a second containing the test documents.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.