[ 
https://issues.apache.org/jira/browse/MAHOUT-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832263#action_12832263
 ] 

Robin Anil commented on MAHOUT-285:
-----------------------------------

Success. I just finished the integration of DictionaryVectorizer with ngram. 

Some stats for reuters collection

52 sec for unigram only vectorization                     Total size: 19MB
3 min 42 second with using bigram(via colloc)     Total size: 26MB


Keeping it open for Drew to help tidy up the code. Its over and out from 
India(4 am)



> Wrap up collocation and dictionary vectorizer integration
> ---------------------------------------------------------
>
>                 Key: MAHOUT-285
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-285
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>             Fix For: 0.3
>
>         Attachments: MAHOUT-285.patch, MAHOUT-285.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Final bit of work to integrate collocations into 0.3
> * Modify collocation finder to use dictionary vectorizer output as input 
> (saves analysis step)
> * Generate input dictionary for dictionary vectorizer that includes unigrams 
> and collocations.
> Chatted with Robin this morning, I know what needs to be done it is just a 
> matter of grinding out the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to