[
https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276869#comment-13276869
]
Andy Schlaikjer commented on MAHOUT-962:
----------------------------------------
Hi John, Grant,
I ran into this issue last summer while working with Jake Mannix on CVB0 LDA. I
ended up writing a Pig script to produce weighted term vectors, along with
Elephant Bird's SequenceFileStorage and VectorWritableConverter utilities:
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/mahout/VectorWritableConverter.java
Now that the above are open sourced, I'd like to get a generic Mahout-Pig
submodule rolling, and perhaps include a version of my term vector script
there. The script ended up being relatively concise, with more flexible term
filtering and weighting mechanisms. Due to Pig's execution plan optimization,
it also ran faster than comparable Mahout utils on my data.
Best,
Andy
@sagemintblue
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf
> in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-962
> URL: https://issues.apache.org/jira/browse/MAHOUT-962
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Reporter: John Conwell
> Priority: Minor
> Fix For: 0.8
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957. The
> desired output is term frequency vectors, but I want terms filtered by their
> min and max DF values. This might be valid in LDA, where tf vectors is
> desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and
> the original tv vectors are not updated to represent the term filtering.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira