[ 
https://issues.apache.org/jira/browse/MAHOUT-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832047#action_12832047
 ] 

Drew Farris commented on MAHOUT-285:
------------------------------------

Yes, I'm very close on this and should be able to wrap it up within the next 
day or two.

bq.: In the Colloc driver why not run DocumentProcessor as the first step 
instead of using SparseVectorsFromSequenceFiles taks 

Will add the option to run the DocumentProcessor or take input from 
DocumentProcessor run separately -- don't want to force to user to re-tokenize 
each time if it is not necessary.

As far as the unigrams are concerned, I'm thinking passing them through the M/R 
tasks makes more sense as opposed to treating them as side data. It increases 
the size of the output, I don't see an elegant way around that but I'm willing 
to entertain suggestions. It will be optional to generate unigrams as well.

> Wrap up collocation and dictionary vectorizer integration
> ---------------------------------------------------------
>
>                 Key: MAHOUT-285
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-285
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>             Fix For: 0.3
>
>         Attachments: MAHOUT-285.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Final bit of work to integrate collocations into 0.3
> * Modify collocation finder to use dictionary vectorizer output as input 
> (saves analysis step)
> * Generate input dictionary for dictionary vectorizer that includes unigrams 
> and collocations.
> Chatted with Robin this morning, I know what needs to be done it is just a 
> matter of grinding out the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to