2010/6/8 Ted Dunning <[email protected]>: > Got it. > > This really needs to be done before vectorization, but you can segregate the > output vector for different handling by passing in a view to different parts > of the vector. > > My recommendation is that you apply IDF using the weight dictionary in the > vectorizer. That will let you have multiple text fields with different > weighting schemes but still put all the results into a single result vector. > As a side effect, if you put everything into a vector of dimension 1, then > you get multi-field weighted inputs for free.
Instead of storing the exact IDF values in an explicit dictionnary, one could use a counting bloom filters datastructure to reduce the memory footprint and speedup the lookups (though lucene is able to handle millions of terms without any perf issues). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
