[ 
https://issues.apache.org/jira/browse/MAHOUT-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1545:
-----------------------------------

    Description: 
The current method for vectorizing data using seq2sparse and then "split" 
allows for a large amount of information to spill over from the training sets 
to the test sets- especially in the case of TF-IDF transformations.  The IDF 
transform provides alot of information on the holdout set to the training set 
if calculated previous to splitting them up.  

I'm not sure if given the current seq2sparse implementation's status as Legacy 
and the relatively minor advantages that it might give whether or not its worth 
adding something like a "split" option to SparseVectorsFromSequenceFiles.java.  
But i know that i saw a new implementation being discussed and and think that 
it would be worth it to have an option like this built in.    

I think that this issue may have been raised before, but i wanted to bring it 
up again in light of the current move away from MapReduce and the new 
implementations of Mahout tools that will be coming along. 



  was:
The current method for vectorizing data using seq2sparse and then "split" 
allows for a large amount of information to spill over from the training sets 
to the test sets- especially in the case of TF-IDF transformations.  The IDF 
transform mainly, but also normalization provide alot of information on the 
holdout set to the training set if calculated previous to splitting them up.  

I'm not sure if given the current seq2sparse implementation's status as Legacy 
and the relatively minor advantages that it might give weather or not its worth 
adding something like a "split" option to SparseVectorsFromSequenceFiles.java.  
But i know that i saw a new implementation being discussed and and think that 
it would be worth it to have an option like this built in.    

I think that this issue may have been raised before, but i wanted to bring it 
up again in light of the current move away from MapReduce and the new 
implementations of Mahout tools that will be coming along. 




> Creating holdout sets with seq2sparse and split
> -----------------------------------------------
>
>                 Key: MAHOUT-1545
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1545
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification, CLI, Examples
>    Affects Versions: 0.9
>            Reporter: Andrew Palumbo
>
> The current method for vectorizing data using seq2sparse and then "split" 
> allows for a large amount of information to spill over from the training sets 
> to the test sets- especially in the case of TF-IDF transformations.  The IDF 
> transform provides alot of information on the holdout set to the training set 
> if calculated previous to splitting them up.  
> I'm not sure if given the current seq2sparse implementation's status as 
> Legacy and the relatively minor advantages that it might give whether or not 
> its worth adding something like a "split" option to 
> SparseVectorsFromSequenceFiles.java.  But i know that i saw a new 
> implementation being discussed and and think that it would be worth it to 
> have an option like this built in.    
> I think that this issue may have been raised before, but i wanted to bring it 
> up again in light of the current move away from MapReduce and the new 
> implementations of Mahout tools that will be coming along. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to