[jira] [Commented] (MAHOUT-1545) Creating holdout sets with seq2sparse and split

Sebastian Schelter (JIRA) Sun, 04 May 2014 22:06:04 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989285#comment-13989285
 ]


Sebastian Schelter commented on MAHOUT-1545:
--------------------------------------------

Good point. We should ensure to get this correct in the new 
preprocessing/vectorization code.

> Creating holdout sets with seq2sparse and split
> -----------------------------------------------
>
>                 Key: MAHOUT-1545
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1545
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification, CLI, Examples
>    Affects Versions: 0.9
>            Reporter: Andrew Palumbo
>
> The current method for vectorizing data using seq2sparse and then "split" 
> allows for a large amount of information to spill over from the training sets 
> to the test sets- especially in the case of TF-IDF transformations.  The IDF 
> transform provides alot of information on the holdout set to the training set 
> if calculated previous to splitting them up.  
> I'm not sure if given the current seq2sparse implementation's status as 
> Legacy and the relatively minor advantages that it might give whether or not 
> its worth adding something like a "split" option to 
> SparseVectorsFromSequenceFiles.java.  But i know that i saw a new 
> implementation being discussed and and think that it would be worth it to 
> have an option like this built in.    
> I think that this issue may have been raised before, but i wanted to bring it 
> up again in light of the current move away from MapReduce and the new 
> implementations of Mahout tools that will be coming along. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1545) Creating holdout sets with seq2sparse and split

Reply via email to