[ 
https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033469#comment-13033469
 ] 

Jake Mannix commented on MAHOUT-695:
------------------------------------

Hmm... the dataset is always there, and for algorithms running on vectors, will 
always indeed consist of at least on vector, naturally.  So I'm guessing that 
we should be able to always do this, and in the grand scheme of things, will be 
not much slower than just supplying the param, so yeah, I agree, let's swap it 
out so that we just sniff and use the info we get from that!

> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-695
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-695
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: mahout-695.patch, mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the 
> LDADriver 
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20 
> with this patch you can instead provide a dictionary; we just count the terms 
> in the dictionary
> eg ./bin/mahout lda \
>      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
>      -o ./examples/bin/work/reuters-lda \
>      -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
>      -k 20 -ow -x 20 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to