Ted,
Seems like this is not a problem if you choose to map docs into an LSI-like
vector space, namely instead of assigning each term its own dimension assign
a term to a sparse vector chosen from {0,1,-1} randomly (0 is most
probable). Problem solved, I think?Randy On Mon, Apr 25, 2011 at 3:11 PM, Ted Dunning <[email protected]> wrote: > Btw... LSH came up recently (thanks Lance!). > > One wrinkle that should be mentioned that might catch somebody implementing > this unawares is > that documents in a vector space model have highly non-random distributions > that make the default > formulation of LSH very bad. > > The problem is that document vectors are normally confined to the positive > orthant. That means that > a random hyper-plane has a very low chance of splitting any to documents > and > thus picking random > vectors as normals is a really bad way to get hash functions. > > This problem can be solved easily enough by picking separating planes by > picking two points at random > without replacement and using their difference as the normal vector for the > separating plane. This > can be shown to give a hashing funcction that has the requisite 50% > probability of being positive for > any document. >
