2010/6/9 Olivier Grisel <[email protected]>: > 2010/6/9 Jake Mannix <[email protected]>: >> On Tue, Jun 8, 2010 at 4:10 PM, Olivier Grisel >> <[email protected]>wrote: >> >>> 2010/6/8 Ted Dunning <[email protected]>: >>> > Got it. >>> > >>> > This really needs to be done before vectorization, but you can segregate >>> the >>> > output vector for different handling by passing in a view to different >>> parts >>> > of the vector. >>> > >>> > My recommendation is that you apply IDF using the weight dictionary in >>> the >>> > vectorizer. That will let you have multiple text fields with different >>> > weighting schemes but still put all the results into a single result >>> vector. >>> > As a side effect, if you put everything into a vector of dimension 1, >>> then >>> > you get multi-field weighted inputs for free. >>> >>> Instead of storing the exact IDF values in an explicit dictionnary, >>> one could use a counting bloom filters datastructure to reduce the >>> memory footprint and speedup the lookups (though lucene is able to >>> handle millions of terms without any perf issues). >>> >> >> Using counting bloom filters is a really good idea here. Do you know >> any good java implementations of these? > > Nope, but AFAIK Ted's combination of probes logic + Murmurhash > implementation does 90% of the work.
Actually according to this very interesting blog post by Jonathan Ellis, both hadoop and cassandra provide tested and efficient counting filters based on murmurhash: http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
