On Tue, Jun 8, 2010 at 4:10 PM, Olivier Grisel <[email protected]>wrote:

> 2010/6/8 Ted Dunning <[email protected]>:
> > Got it.
> >
> > This really needs to be done before vectorization, but you can segregate
> the
> > output vector for different handling by passing in a view to different
> parts
> > of the vector.
> >
> > My recommendation is that you apply IDF using the weight dictionary in
> the
> > vectorizer.  That will let you have multiple text fields with different
> > weighting schemes but still put all the results into a single result
> vector.
> >  As a side effect, if you put everything into a vector of dimension 1,
> then
> > you get multi-field weighted inputs for free.
>
> Instead of storing the exact IDF values in an explicit dictionnary,
> one could use a counting bloom filters datastructure to reduce the
> memory footprint and speedup the lookups (though lucene is able to
> handle millions of terms without any perf issues).
>

Using counting bloom filters is a really good idea here.  Do you know
any good java implementations of these?

  -jake

Reply via email to