[ 
https://issues.apache.org/jira/browse/MAHOUT-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated MAHOUT-1019:
-----------------------------------

    Attachment: MAHOUT-1019.patch

Ok, so here's a solution to this problem. It really made a huge difference in 
the size of the output in our environment. With this patch, the output was 
reduced to a few Gigs as opposed to 4+ TBs!

I only applied this to the "pw" style output as it probably isn't useful for 
the "v" style output. Also, the default value is Double.MAX_VALUE since a 
default of 1.0 would imply you were using cosine distance.
                
> VectorDistanceSimilarityJob
> ---------------------------
>
>                 Key: MAHOUT-1019
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1019
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>         Environment: all
>            Reporter: Timothy Potter
>            Priority: Minor
>              Labels: VectorDistanceSimilarityJob, distance, vector
>         Attachments: MAHOUT-1019.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> The VectorDistanceSimilarityJob is a fantastic tool, but poses the risk of 
> creating terabytes of output of dubious value. For example, I have ~10K seed 
> vectors and millions of vectors to compute the similarity between so I would 
> like to add an optional parameter to this job to specify a maximum distance 
> threshold that prevents any distances above this value from being written to 
> the output. The default would be 1.0d so no filtering is applied which 
> ensures backwards compatibility, but if supplied, only rows where the 
> distance is less than the threshold would be output from the mapper. This can 
> help reduce the storage requirements of the output immensely. Probably name 
> the parameter something like: noOutputIfDistanceGreaterThan

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to