2010/6/9 Olivier Grisel <[email protected]>:
> 2010/6/9 Jake Mannix <[email protected]>:
>> On Tue, Jun 8, 2010 at 4:10 PM, Olivier Grisel 
>> <[email protected]>wrote:
>>
>>> 2010/6/8 Ted Dunning <[email protected]>:
>>> > Got it.
>>> >
>>> > This really needs to be done before vectorization, but you can segregate
>>> the
>>> > output vector for different handling by passing in a view to different
>>> parts
>>> > of the vector.
>>> >
>>> > My recommendation is that you apply IDF using the weight dictionary in
>>> the
>>> > vectorizer.  That will let you have multiple text fields with different
>>> > weighting schemes but still put all the results into a single result
>>> vector.
>>> >  As a side effect, if you put everything into a vector of dimension 1,
>>> then
>>> > you get multi-field weighted inputs for free.
>>>
>>> Instead of storing the exact IDF values in an explicit dictionnary,
>>> one could use a counting bloom filters datastructure to reduce the
>>> memory footprint and speedup the lookups (though lucene is able to
>>> handle millions of terms without any perf issues).
>>>
>>
>> Using counting bloom filters is a really good idea here.  Do you know
>> any good java implementations of these?
>
> Nope, but AFAIK Ted's combination of probes logic + Murmurhash
> implementation does 90% of the work.

Actually according to this very interesting blog post by Jonathan
Ellis, both hadoop and cassandra provide tested and efficient counting
filters based on murmurhash:

 http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to