Re: LSH notes on text documents

Jake Mannix Mon, 25 Apr 2011 15:46:00 -0700

If you run LSH over the SVD-projections of your vectors, on the other hand,
you have not only distributed them in all orthants, but made the
distribution mostly uniform on the large scale, on account of the singular
value rescaling.


On Apr 25, 2011 3:12 PM, "Ted Dunning" <[email protected]> wrote:

Btw... LSH came up recently (thanks Lance!).

One wrinkle that should be mentioned that might catch somebody implementing
this unawares is
that documents in a vector space model have highly non-random distributions
that make the default
formulation of LSH very bad.

The problem is that document vectors are normally confined to the positive
orthant.  That means that
a random hyper-plane has a very low chance of splitting any to documents and
thus picking random
vectors as normals is a really bad way to get hash functions.

This problem can be solved easily enough by picking separating planes by
picking two points at random
without replacement and using their difference as the normal vector for the
separating plane.  This
can be shown to give a hashing funcction that has the requisite 50%
probability of being positive for
any document.

Re: LSH notes on text documents

Reply via email to