[ https://issues.apache.org/jira/browse/MAHOUT-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845495#action_12845495 ]
Jake Mannix commented on MAHOUT-337: ------------------------------------ [quote] Yes it's possible to fix this by forcing vectors to compute their length-squared always before serialization. [quote] Why would we do this? If it's been computed, then we serialize and store the additional 8-bytes and transmit it with the vector. If it has not been computed, this number gets transmitted as -1. The cost of computing lengthSquared is the just a dot-product (with itself). In many algorithms, during a M/R pass, you may do the following: deserialize the vector, compute it's dot product with some other vector (the canopy centroid, the eigenvector, etc), then possibly normalize this value by dividing by Math.sqrt(getLengthSquared()), and then this is accumulated somewhere. This then happens for every of N vectors, on each of maybe K times (N is millions to hundreds or millions or more, and K is maybe tens to hundreds). If the vectors have D nonzero elements, the computational cost is taking O(N*K*(dot(D) + deserialize(D+1) ) for the cached case, vs. O(N*K*(dot(D) + dot(D) + deserialize(D)), where dot(D) is the cost of computing a dot product with a vector with D nonzero elements, and deserialize(x) is the cost of deserializing and transporting around a vector with x nonzero elements. I'm pretty sure that computing two D-sparse dot products and moving around a D-sparse vector is way worse than computing one D-sparse dot product and moving around a (D+1)-sparse vector. It defeats the entire purpose of caching to have to recompute it on every single M/R pass. > Don't serialize cached length squared in JSON vector representation > ------------------------------------------------------------------- > > Key: MAHOUT-337 > URL: https://issues.apache.org/jira/browse/MAHOUT-337 > Project: Mahout > Issue Type: Bug > Components: Math > Affects Versions: 0.3 > Reporter: Sean Owen > Assignee: Sean Owen > Priority: Minor > Fix For: 0.4 > > > The cached length-squared field in vectors should be marked transient so that > it is not part of the JSON serialized state. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.