[ 
https://issues.apache.org/jira/browse/MAHOUT-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845495#action_12845495
 ] 

Jake Mannix commented on MAHOUT-337:
------------------------------------

[quote]
Yes it's possible to fix this by forcing vectors to compute their 
length-squared always before serialization. 
[quote]

Why would we do this?  If it's been computed, then we serialize and store the 
additional 8-bytes and transmit it with the vector.  If it has not been 
computed, this number gets transmitted as -1.  

The cost of computing lengthSquared is the just a dot-product (with itself).  
In many algorithms, during a M/R pass, you may do the following: deserialize 
the vector, compute it's dot product with some other vector (the canopy 
centroid, the eigenvector, etc), then possibly normalize this value by dividing 
by Math.sqrt(getLengthSquared()), and then this is accumulated somewhere.  This 
then happens for every of N vectors, on each of maybe K times (N is millions to 
hundreds or millions or more, and K is maybe tens to hundreds).

If the vectors have D nonzero elements, the computational cost is taking 
O(N*K*(dot(D) + deserialize(D+1) ) for the cached case, vs. O(N*K*(dot(D) + 
dot(D) + deserialize(D)), where dot(D) is the cost of computing a dot product 
with a vector with D nonzero elements, and deserialize(x) is the cost of 
deserializing and transporting around a vector with x nonzero elements.

I'm pretty sure that computing two D-sparse dot products and moving around a 
D-sparse vector is way worse than computing one D-sparse dot product and moving 
around a (D+1)-sparse vector.

It defeats the entire purpose of caching to have to recompute it on every 
single M/R pass.

> Don't serialize cached length squared in JSON vector representation
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-337
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-337
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.4
>
>
> The cached length-squared field in vectors should be marked transient so that 
> it is not part of the JSON serialized state. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to