If you run LSH over the SVD-projections of your vectors, on the other hand, you have not only distributed them in all orthants, but made the distribution mostly uniform on the large scale, on account of the singular value rescaling.
On Apr 25, 2011 3:12 PM, "Ted Dunning" <[email protected]> wrote: Btw... LSH came up recently (thanks Lance!). One wrinkle that should be mentioned that might catch somebody implementing this unawares is that documents in a vector space model have highly non-random distributions that make the default formulation of LSH very bad. The problem is that document vectors are normally confined to the positive orthant. That means that a random hyper-plane has a very low chance of splitting any to documents and thus picking random vectors as normals is a really bad way to get hash functions. This problem can be solved easily enough by picking separating planes by picking two points at random without replacement and using their difference as the normal vector for the separating plane. This can be shown to give a hashing funcction that has the requisite 50% probability of being positive for any document.
