[ 
https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898505#action_12898505
 ] 

Ted Dunning commented on MAHOUT-479:
------------------------------------

Regarding vectorization strategies in general, I know of three that we 
currently use:

a) full-size vector model.  Each continuous variable and each unique term for 
each text-like variable and each unique interaction term is assigned a unique 
location in the feature vector.  This requires that we know the vocabulary and 
the number of interaction combinations that exist.  This is the strategy used 
by the Lucene to vector converter.

b) pass around the text.  This is what Naive Bayes currently does.  The result 
is kind of like a duplicated vector implementation with strings as the indices. 
 The current implementation has difficult with different fields containing text 
and with continuous variables.  This avoids a dictionary building pass, but 
leads to other difficulties unless we figure out a way to inject a text 
vectorizer.

c) feature hashing.  This is what SGD supports.  Here, we pick the size of the 
vectors a priori.  Words are assigned a location based on a hash of the 
variable name and the word value.  Continuous variables are assigned locations 
based on the variable name.  It can be moderately difficult to reverse engineer 
a vector back to features since there can be ambiguity with very large feature 
spaces, but it isn't necessary to build a dictionary in order to make vectors.


> Streamline classification/ clustering data structures
> -----------------------------------------------------
>
>                 Key: MAHOUT-479
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-479
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.1, 0.2, 0.3, 0.4
>            Reporter: Isabel Drost
>
> Opening this JIRA issue to collect ideas on how to streamline our 
> classification and clustering algorithms to make integration for users easier 
> as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs
> {quote}
> Jake and Robin and I were talking the other evening and a common lament was 
> that our classification (and clustering) stuff was all over the map in terms 
> of data structures.  Driving that to rest and getting those comments even 
> vaguely as plug and play as our much more advanced recommendation components 
> would be very, very helpful.
> {quote}
> This issue probably also realates to MAHOUT-287 (intention there is to make 
> naive bayes run on vectors as input).
> Ted, Jake, Robin: Would be great if someone of you could add a comment on 
> some of the issues you discussed "the other evening" and (if applicable) any 
> minor or major changes you think could help solve this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to