Re: producing vectors from composite documents

Olivier Grisel Tue, 08 Jun 2010 16:11:42 -0700

2010/6/8 Ted Dunning <[email protected]>:
> Got it.
>
> This really needs to be done before vectorization, but you can segregate the
> output vector for different handling by passing in a view to different parts
> of the vector.
>
> My recommendation is that you apply IDF using the weight dictionary in the
> vectorizer.  That will let you have multiple text fields with different
> weighting schemes but still put all the results into a single result vector.
>  As a side effect, if you put everything into a vector of dimension 1, then
> you get multi-field weighted inputs for free.


Instead of storing the exact IDF values in an explicit dictionnary,
one could use a counting bloom filters datastructure to reduce the
memory footprint and speedup the lookups (though lucene is able to
handle millions of terms without any perf issues).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: producing vectors from composite documents

Reply via email to