Ted,
Seems like this is not a problem if you choose to map docs into an LSI-like
vector space, namely instead of assigning each term its own dimension assign
a term to a sparse vector chosen from {0,1,-1} randomly (0 is most
probable). Problem solved, I think?

Randy

On Mon, Apr 25, 2011 at 3:11 PM, Ted Dunning <[email protected]> wrote:

> Btw... LSH came up recently (thanks Lance!).
>
> One wrinkle that should be mentioned that might catch somebody implementing
> this unawares is
> that documents in a vector space model have highly non-random distributions
> that make the default
> formulation of LSH very bad.
>
> The problem is that document vectors are normally confined to the positive
> orthant.  That means that
> a random hyper-plane has a very low chance of splitting any to documents
> and
> thus picking random
> vectors as normals is a really bad way to get hash functions.
>
> This problem can be solved easily enough by picking separating planes by
> picking two points at random
> without replacement and using their difference as the normal vector for the
> separating plane.  This
> can be shown to give a hashing funcction that has the requisite 50%
> probability of being positive for
> any document.
>

Reply via email to