Mahout doesn't have LSH, does it? I though i saw an issue but my jira
query comes back empty...

On Mon, Apr 25, 2011 at 3:11 PM, Ted Dunning <[email protected]> wrote:
> Btw... LSH came up recently (thanks Lance!).
>
> One wrinkle that should be mentioned that might catch somebody implementing
> this unawares is
> that documents in a vector space model have highly non-random distributions
> that make the default
> formulation of LSH very bad.
>
> The problem is that document vectors are normally confined to the positive
> orthant.  That means that
> a random hyper-plane has a very low chance of splitting any to documents and
> thus picking random
> vectors as normals is a really bad way to get hash functions.
>
> This problem can be solved easily enough by picking separating planes by
> picking two points at random
> without replacement and using their difference as the normal vector for the
> separating plane.  This
> can be shown to give a hashing funcction that has the requisite 50%
> probability of being positive for
> any document.
>

Reply via email to