SpanScorer handling of non-disjoint phrases

David Kaelbling Wed, 23 Apr 2008 13:16:29 -0700

Hi,

I've been using the 2.3.1 contrib highlighter with the 2/10/2008
SpanHighlighter patch, and have run into some trouble.  If I have two
phrases in a query that share terms (e.g. "hello world" and "hello
goodbye") the SpanScorer seems to not highlight 'hello' consistently.


It looks to me like WeightedSpanTermExtractor.extract() is clobbering
the span positions for 'hello' the second time it encounters the term.
Should terms.putAll(booleanTerms) and terms.putAll(disjunctTerms) really
be replacing the old entry, or should the try to addPositionSpans()?

        Thanks,
        David

PS: And while I'm asking, it looks like getWeightedSpanTermsWithScores()
will wrap the cachingTokenFilter passed it by SpanScorer.init() into
another CachingTokenFilter, duplicating the cache?

-- 
David Kaelbling
Senior Software Engineer
Black Duck Software, Inc.

[EMAIL PROTECTED]
T +1.781.810.2041
F +1.781.891.5145

http://www.blackducksoftware.com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

SpanScorer handling of non-disjoint phrases

Reply via email to