[
https://issues.apache.org/jira/browse/MAHOUT-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033651#comment-13033651
]
Mat Kelcey commented on MAHOUT-695:
-----------------------------------
here's another patch for determining the num words from the first vector.
i've left numwords option in though as a form of deprecation so a warning can
be given. the alternate of taking the option out would fail at startup
complaining about the unknown arg. so depending on how much backwards
compatibility you're after this might not be needed...
> Option to determine number of words for LDADriver from a specified dictionary
> -----------------------------------------------------------------------------
>
> Key: MAHOUT-695
> URL: https://issues.apache.org/jira/browse/MAHOUT-695
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.5
> Reporter: Mat Kelcey
> Priority: Minor
> Attachments: mahout-695-sniff-vector.patch, mahout-695.patch,
> mahout-695.patch
>
>
> It bugged me that you needed to specify the number of words directly to the
> LDADriver
> eg ./bin/mahout lda \
> -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
> -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20
> with this patch you can instead provide a dictionary; we just count the terms
> in the dictionary
> eg ./bin/mahout lda \
> -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
> -o ./examples/bin/work/reuters-lda \
> -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 \
> -k 20 -ow -x 20
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira