[ 
https://issues.apache.org/jira/browse/MAHOUT-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-451:
-------------------------------

    Attachment: MAHOUT-451.patch

Thanks for the feedback Robin. Here's an updated patch that allows a variety of 
different ways of selecting a test set + resolves issue with OptionException 
not being printed. Class is now instance based instead of being a collection of 
static methods.

The composition of the test set is determined using one of the following 
approaches (setters also exposed via command-line arguments):

A contiguous set of items can be chosen from the input file(s) using the 
setTestSplitSize(int) or setTestSplitPct(int) methods. setTestSplitSize(int) 
allocates a fixed number of items, while setTestSplitPct(int) allocates a 
percentage of the original input, rounded up to the nearest integer. 
setSplitLocation(int) is used to control the position in the input from which 
the test data is extracted and is described in command-line help and the 
javadoc.

A random sampling of items can be chosen from the input files(s) using the 
setTestRandomSelectSize(int) or setTestRandomSelectionPct(int) methods, each 
choosing a fixed test set size or percentage of the input set size as described 
above. The RandomSampler class from <code>mahout-math</code> is used to create 
a sample of the appropriate size.




> Simple utility to split bayes input into training/test sets
> -----------------------------------------------------------
>
>                 Key: MAHOUT-451
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-451
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Priority: Minor
>         Attachments: MAHOUT-451.patch, MAHOUT-451.patch
>
>
> Provides a simply utility that you point at a directory containing files in 
> Bayes classifier input format. Given the number of documents to write to the 
> test set, for each input file it will produce files in two output 
> directories, one containing training data with the test documents removed and 
> a second containing the test documents. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to