[
https://issues.apache.org/jira/browse/MAHOUT-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989285#comment-13989285
]
Sebastian Schelter commented on MAHOUT-1545:
--------------------------------------------
Good point. We should ensure to get this correct in the new
preprocessing/vectorization code.
> Creating holdout sets with seq2sparse and split
> -----------------------------------------------
>
> Key: MAHOUT-1545
> URL: https://issues.apache.org/jira/browse/MAHOUT-1545
> Project: Mahout
> Issue Type: Bug
> Components: Classification, CLI, Examples
> Affects Versions: 0.9
> Reporter: Andrew Palumbo
>
> The current method for vectorizing data using seq2sparse and then "split"
> allows for a large amount of information to spill over from the training sets
> to the test sets- especially in the case of TF-IDF transformations. The IDF
> transform provides alot of information on the holdout set to the training set
> if calculated previous to splitting them up.
> I'm not sure if given the current seq2sparse implementation's status as
> Legacy and the relatively minor advantages that it might give whether or not
> its worth adding something like a "split" option to
> SparseVectorsFromSequenceFiles.java. But i know that i saw a new
> implementation being discussed and and think that it would be worth it to
> have an option like this built in.
> I think that this issue may have been raised before, but i wanted to bring it
> up again in light of the current move away from MapReduce and the new
> implementations of Mahout tools that will be coming along.
--
This message was sent by Atlassian JIRA
(v6.2#6252)