On Thu, Apr 30, 2009 at 5:16 AM, Michael McCandless < luc...@mikemccandless.com> wrote:
> On Thu, Apr 30, 2009 at 12:15 AM, Max Lynch <ihas...@gmail.com> wrote: > > You should switch to the SpanScorer (in o.a.l.search.highlighter). > >> That fragment scorer should only match true phrase matches. > >> > >> Mike > >> > > > > Thanks Mike. I gave it a try and it wasn't working how I expected. I am > > using pylucene right now so I can ask them if the implementation is > > different. I'm messing around with the lucene unit tests to see exactly > how > > the scorer should work. > > Can you give more details on what's not working right? Sorry, the following code is in python, but I can hack a Java thing together if necessary. HighlighterSpanScorer is the SpanScorer from the highlight package just renamed to avoid conflict with the other SpanScorer object. Well what happens is if I use a SpanScorer instead, and allocate it like such: analyzer = StandardAnalyzer([]) tokenStream = analyzer.tokenStream("contents", lucene.StringReader(text)) ctokenStream = lucene.CachingTokenFilter(tokenStream) highlighter = lucene.Highlighter(formatter, lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream)) ctokenStream.reset() result = highlighter.getBestFragments(ctokenStream, text, 2, "...") My highlighter is still breaking up words inside of a span. For example, if I search for \"John Smith\", instead of the highlighter being called for the whole "John Smith", it gets called for "John" and then "Smith". > > > In the mean time, If I am interested in finding out exactly how many > times a > > term was found in a document, what is the best way to go about this? The > > way I am doing it right now is using a highlighter and just incrementing > > counters when a word is found that I'm interested. I just came across > > FieldSortedTermVectorMapper that could do something similar. Is > > FieldSortedTermVectorMapper something I could use for this? Is there a > > better option? > > > Is it really just single terms you need to measure? (eg, not "how > many times did phrase XYZ occur in the doc"). If so, then getting the > term vectors and locating your term in there, should work. This is > probably OK if you just do it for each of the hits on the page (like > 10 hits), but will be way too slow if you try to do it for say all > docs that matched the query. > I see how the term vector might be used. I can't really tell if there is a way for me to do a Span check on the words as easily as the highlighter would do. -max