Re: amusing interaction between advanced tokenizers and highlighter package

Erik Hatcher Sat, 19 Jun 2004 02:16:44 -0700

On Jun 19, 2004, at 2:29 AM, David Spencer wrote:

A naive analyzer would turn something like "SyncThreadPool" into one token. Mine uses the great Lucene capability of Tokens being able to have a "0" position increment to turn it into the token stream:
Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so "SyncThread" and "ThreadPool" appear too].

The point behind this is someone searching for "threadpool" probably would want to see a match for "SyncThreadPool" even this is the evil leading-prefix case. With most other Analyzers and ways of forming a query this would be missed, which I think is anti-human and annoys me to no end.

There are indexing/querying solutions/workarounds to the leading-prefix issue, such as reversing the text as you index it and ensuring you do the same on queries so they match. There are some interesting techniques for this type of thing in the Managing Gigabytes book I'm currently reading, which Lucene could support with custom analysis and queries, I believe.

The problem is as follows. In all cases I use my Analyzer to index the documents. If I use my Analyzer with with the Highligher package, it doesn't look at the position increment of the tokens and consequently a nonsense stream of matches is output. If I use a different Analyzer w/ the highlighter (say, the StandardAnalyzer), then it doesn't show the matches that really matched, as it doesn't see the "subtokens".

Are your "subtokens" marked with correct offset values? This probably doesn't relate to the problem you're seeing, but I'm curious.

It might be the fix is for the Highlighter to look at the position increment of tokens and only pass by one if multiple ones have an incr of 0 and match one part of the query.

Has this come up before and is the issue clear?

The problem is clear, and I've identified this issue with my exploration of the Highlighter also. The Highlighter works well for the most common scenarios, but certainly doesn't cover all the bases. The majority of scenarios do not use multiple tokens in a single position. Also, it also doesn't currently handle the new SpanQuery family - although Highlighting spans would be quite cool. After learning how Highlighter works, I have a deep appreciation for the great work Mark put into it - it is well done.

As for this issue, though, I think your solution sounds reasonable, although I haven't thought it through completely. Perhaps Mark can comment. If you do modify it to work for your case, it would be great to have your contribution rolled back in :)

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: amusing interaction between advanced tokenizers and highlighter package

Reply via email to