[
https://issues.apache.org/jira/browse/MAHOUT-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Drew Farris updated MAHOUT-451:
-------------------------------
Attachment: MAHOUT-451.patch
Thanks for the feedback Robin. Here's an updated patch that allows a variety of
different ways of selecting a test set + resolves issue with OptionException
not being printed. Class is now instance based instead of being a collection of
static methods.
The composition of the test set is determined using one of the following
approaches (setters also exposed via command-line arguments):
A contiguous set of items can be chosen from the input file(s) using the
setTestSplitSize(int) or setTestSplitPct(int) methods. setTestSplitSize(int)
allocates a fixed number of items, while setTestSplitPct(int) allocates a
percentage of the original input, rounded up to the nearest integer.
setSplitLocation(int) is used to control the position in the input from which
the test data is extracted and is described in command-line help and the
javadoc.
A random sampling of items can be chosen from the input files(s) using the
setTestRandomSelectSize(int) or setTestRandomSelectionPct(int) methods, each
choosing a fixed test set size or percentage of the input set size as described
above. The RandomSampler class from <code>mahout-math</code> is used to create
a sample of the appropriate size.
> Simple utility to split bayes input into training/test sets
> -----------------------------------------------------------
>
> Key: MAHOUT-451
> URL: https://issues.apache.org/jira/browse/MAHOUT-451
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Affects Versions: 0.3
> Reporter: Drew Farris
> Priority: Minor
> Attachments: MAHOUT-451.patch, MAHOUT-451.patch
>
>
> Provides a simply utility that you point at a directory containing files in
> Bayes classifier input format. Given the number of documents to write to the
> test set, for each input file it will produce files in two output
> directories, one containing training data with the test documents removed and
> a second containing the test documents.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.