I thought that for a moment as well, but now I think not. The trivial counter example consists of one element for each single component unit vector e_i. The SVD of this has unit singular values and the eigenvectors are just the e_i themselves. After transformation, all values are still in the positive orthant.
I may have misunderstood what you are suggesting, but I don't think that LSH could distribute the values into other orthants. It may be that in practice that SVD will scatter data into many orthants, but I suspect it will not spread the data as widely as LSH assumptions would like. On the other hand, the two point hack that I mentioned guarantees that it meets the LSH assumptions so it isn't a failure of LSH, just that a random selection of hash functions is sub-optimal for data consisting of only positive values. On Mon, Apr 25, 2011 at 3:45 PM, Jake Mannix <[email protected]> wrote: > If you run LSH over the SVD-projections of your vectors, on the other hand, > you have not only distributed them in all orthants, but made the > distribution mostly uniform on the large scale, on account of the singular > value rescaling. > > On Apr 25, 2011 3:12 PM, "Ted Dunning" <[email protected]> wrote: > > Btw... LSH came up recently (thanks Lance!). > > One wrinkle that should be mentioned that might catch somebody implementing > this unawares is > that documents in a vector space model have highly non-random distributions > that make the default > formulation of LSH very bad. > > The problem is that document vectors are normally confined to the positive > orthant. That means that > a random hyper-plane has a very low chance of splitting any to documents > and > thus picking random > vectors as normals is a really bad way to get hash functions. > > This problem can be solved easily enough by picking separating planes by > picking two points at random > without replacement and using their difference as the normal vector for the > separating plane. This > can be shown to give a hashing funcction that has the requisite 50% > probability of being positive for > any document. >
