Mahout doesn't have LSH, does it? I though i saw an issue but my jira query comes back empty...
On Mon, Apr 25, 2011 at 3:11 PM, Ted Dunning <[email protected]> wrote: > Btw... LSH came up recently (thanks Lance!). > > One wrinkle that should be mentioned that might catch somebody implementing > this unawares is > that documents in a vector space model have highly non-random distributions > that make the default > formulation of LSH very bad. > > The problem is that document vectors are normally confined to the positive > orthant. That means that > a random hyper-plane has a very low chance of splitting any to documents and > thus picking random > vectors as normals is a really bad way to get hash functions. > > This problem can be solved easily enough by picking separating planes by > picking two points at random > without replacement and using their difference as the normal vector for the > separating plane. This > can be shown to give a hashing funcction that has the requisite 50% > probability of being positive for > any document. >
