Timothy Potter created MAHOUT-1019:
--------------------------------------
Summary: VectorDistanceSimilarityJob
Key: MAHOUT-1019
URL: https://issues.apache.org/jira/browse/MAHOUT-1019
Project: Mahout
Issue Type: Improvement
Components: Math
Environment: all
Reporter: Timothy Potter
Priority: Minor
The VectorDistanceSimilarityJob is a fantastic tool, but poses the risk of
creating terabytes of output of dubious value. For example, I have ~10K seed
vectors and millions of vectors to compute the similarity between so I would
like to add an optional parameter to this job to specify a maximum distance
threshold that prevents any distances above this value from being written to
the output. The default would be 1.0d so no filtering is applied which ensures
backwards compatibility, but if supplied, only rows where the distance is less
than the threshold would be output from the mapper. This can help reduce the
storage requirements of the output immensely. Probably name the parameter
something like: noOutputIfDistanceGreaterThan
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira