[
https://issues.apache.org/jira/browse/MAHOUT-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timothy Potter updated MAHOUT-1019:
-----------------------------------
Attachment: MAHOUT-1019.patch
Ok, so here's a solution to this problem. It really made a huge difference in
the size of the output in our environment. With this patch, the output was
reduced to a few Gigs as opposed to 4+ TBs!
I only applied this to the "pw" style output as it probably isn't useful for
the "v" style output. Also, the default value is Double.MAX_VALUE since a
default of 1.0 would imply you were using cosine distance.
> VectorDistanceSimilarityJob
> ---------------------------
>
> Key: MAHOUT-1019
> URL: https://issues.apache.org/jira/browse/MAHOUT-1019
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Environment: all
> Reporter: Timothy Potter
> Priority: Minor
> Labels: VectorDistanceSimilarityJob, distance, vector
> Attachments: MAHOUT-1019.patch
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> The VectorDistanceSimilarityJob is a fantastic tool, but poses the risk of
> creating terabytes of output of dubious value. For example, I have ~10K seed
> vectors and millions of vectors to compute the similarity between so I would
> like to add an optional parameter to this job to specify a maximum distance
> threshold that prevents any distances above this value from being written to
> the output. The default would be 1.0d so no filtering is applied which
> ensures backwards compatibility, but if supplied, only rows where the
> distance is less than the threshold would be output from the mapper. This can
> help reduce the storage requirements of the output immensely. Probably name
> the parameter something like: noOutputIfDistanceGreaterThan
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira