Timothy Potter created MAHOUT-1019:
--------------------------------------

             Summary: VectorDistanceSimilarityJob
                 Key: MAHOUT-1019
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1019
             Project: Mahout
          Issue Type: Improvement
          Components: Math
         Environment: all
            Reporter: Timothy Potter
            Priority: Minor


The VectorDistanceSimilarityJob is a fantastic tool, but poses the risk of 
creating terabytes of output of dubious value. For example, I have ~10K seed 
vectors and millions of vectors to compute the similarity between so I would 
like to add an optional parameter to this job to specify a maximum distance 
threshold that prevents any distances above this value from being written to 
the output. The default would be 1.0d so no filtering is applied which ensures 
backwards compatibility, but if supplied, only rows where the distance is less 
than the threshold would be output from the mapper. This can help reduce the 
storage requirements of the output immensely. Probably name the parameter 
something like: noOutputIfDistanceGreaterThan

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to