Re: LSH notes on text documents

Ted Dunning Mon, 25 Apr 2011 16:09:05 -0700

I thought that for a moment as well, but now I think not.

The trivial counter example consists of one element for each single
component unit vector e_i.  The SVD of this has unit singular values and the
eigenvectors are just the e_i themselves.  After transformation, all values
are still in the positive orthant.

I may have misunderstood what you are suggesting, but I don't think that LSH
could distribute the values into other orthants.

It may be that in practice that SVD will scatter data into many orthants,
but I suspect it will not spread the data as widely as LSH assumptions would
like.  On the other hand, the two point hack that I mentioned guarantees
that it meets the LSH assumptions so it isn't a failure of LSH, just that a
random selection of hash functions is sub-optimal for data consisting of
only positive values.

On Mon, Apr 25, 2011 at 3:45 PM, Jake Mannix <[email protected]> wrote:

> If you run LSH over the SVD-projections of your vectors, on the other hand,
> you have not only distributed them in all orthants, but made the
> distribution mostly uniform on the large scale, on account of the singular
> value rescaling.
>
> On Apr 25, 2011 3:12 PM, "Ted Dunning" <[email protected]> wrote:
>
> Btw... LSH came up recently (thanks Lance!).
>
> One wrinkle that should be mentioned that might catch somebody implementing
> this unawares is
> that documents in a vector space model have highly non-random distributions
> that make the default
> formulation of LSH very bad.
>
> The problem is that document vectors are normally confined to the positive
> orthant.  That means that
> a random hyper-plane has a very low chance of splitting any to documents
> and
> thus picking random
> vectors as normals is a really bad way to get hash functions.
>
> This problem can be solved easily enough by picking separating planes by
> picking two points at random
> without replacement and using their difference as the normal vector for the
> separating plane.  This
> can be shown to give a hashing funcction that has the requisite 50%
> probability of being positive for
> any document.
>

Re: LSH notes on text documents

Reply via email to