On Tue, Jun 8, 2010 at 4:10 PM, Olivier Grisel <[email protected]>wrote:
> 2010/6/8 Ted Dunning <[email protected]>: > > Got it. > > > > This really needs to be done before vectorization, but you can segregate > the > > output vector for different handling by passing in a view to different > parts > > of the vector. > > > > My recommendation is that you apply IDF using the weight dictionary in > the > > vectorizer. That will let you have multiple text fields with different > > weighting schemes but still put all the results into a single result > vector. > > As a side effect, if you put everything into a vector of dimension 1, > then > > you get multi-field weighted inputs for free. > > Instead of storing the exact IDF values in an explicit dictionnary, > one could use a counting bloom filters datastructure to reduce the > memory footprint and speedup the lookups (though lucene is able to > handle millions of terms without any perf issues). > Using counting bloom filters is a really good idea here. Do you know any good java implementations of these? -jake
