Yes. That would be a random projection that would give zero mean. You could map to a binary space as you suggest or use a continuous random projection. Since you are likely mapping to a lower dimensional space to avoid disastrous expansion of the problem, I would be tempted to use the continuous projection to preserve information leading into the LSH.
It would also be interesting to do one round of cooccurrence training a la semantic indexing. That would make the LSH vectors be a bit more semantic. On Mon, Apr 25, 2011 at 10:38 PM, Randall McRee <[email protected]>wrote: > Ted, > Seems like this is not a problem if you choose to map docs into an LSI-like > vector space, namely instead of assigning each term its own dimension > assign > a term to a sparse vector chosen from {0,1,-1} randomly (0 is most > probable). Problem solved, I think? > > Randy > > On Mon, Apr 25, 2011 at 3:11 PM, Ted Dunning <[email protected]> > wrote: > > > Btw... LSH came up recently (thanks Lance!). > > > > One wrinkle that should be mentioned that might catch somebody > implementing > > this unawares is > > that documents in a vector space model have highly non-random > distributions > > that make the default > > formulation of LSH very bad. > > > > The problem is that document vectors are normally confined to the > positive > > orthant. That means that > > a random hyper-plane has a very low chance of splitting any to documents > > and > > thus picking random > > vectors as normals is a really bad way to get hash functions. > > > > This problem can be solved easily enough by picking separating planes by > > picking two points at random > > without replacement and using their difference as the normal vector for > the > > separating plane. This > > can be shown to give a hashing funcction that has the requisite 50% > > probability of being positive for > > any document. > > >
