Btw... LSH came up recently (thanks Lance!).

One wrinkle that should be mentioned that might catch somebody implementing
this unawares is
that documents in a vector space model have highly non-random distributions
that make the default
formulation of LSH very bad.

The problem is that document vectors are normally confined to the positive
orthant.  That means that
a random hyper-plane has a very low chance of splitting any to documents and
thus picking random
vectors as normals is a really bad way to get hash functions.

This problem can be solved easily enough by picking separating planes by
picking two points at random
without replacement and using their difference as the normal vector for the
separating plane.  This
can be shown to give a hashing funcction that has the requisite 50%
probability of being positive for
any document.

Reply via email to