[jira] Commented: (LUCENE-794) Beginnings of a span based highlighter

Mark Miller (JIRA) Sun, 04 Feb 2007 15:37:27 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12470098
 ]


Mark Miller commented on LUCENE-794:
------------------------------------

Sorry about all that Mark H. This was literally just some test code that I 
quickly shoved into an api similar to your existing highlighter. If you decided 
that it should be something considered on it's own I would certainly have quite 
a bit further to go. Mostly I just put it up for your evaluation on extending 
the current highlighter with this highlight method.

>1) Fieldname "contents" shouldn't be hardcoded into the Highlighter - 
>different analyzers can behave differently for different fields (see 
>>PerFieldAnalyzerWrapper). Either pass a fieldname parameter or do as the 
>existing highlighter does and take a TokenStream. The latter approach >has the 
>advantage of being able to avoid re-analysis and make use of any stored 
>TermVectors (see TokenSources.java)

I don't have a great solution for this right now. I need to read the 
TokenStream at least twice due to the MemoryIndex extracting the spans. 
Unfortunately, it seems I can copy the tokens to a list or pass them to the 
MemoryIndex -- I cannot do both. The MemoryIndex is also looking for a field 
name...so while I changed the api to take a TokenStream, I have not resolved 
also needing the field name. I am hoping you have some good comments. To get 
around reading the TokenStream twice I used the horribly hackey but 
quick-for-me method of adding a method to MemoryIndex that accepts a List of 
Tokens. Any ideas?

2) Analyzers which produce overlapping tokens (see Synonym analyzer in existing 
highlighter Junit test) are problematic in the existing code. I remember the 
"TokenGroup" class in the existing highlighter was an approach to help cater 
for these "overlap" scenarios.

I always attack this last <G>. Seems a simple fix: if position increment equals 
0 skip printing out the token. It passes your test which I have added to my 
test code, but I am not totally confident it is perfect yet.

3) Without wishing to resurrect the whole 1.4 vs 1.5 debate I beleive Lucene 
still targets Java 1.4.

Just me being lazy. I swear I have seen Contrib stuff that says 1.5. I have 
gone through and stripped out all of the 1.4 except for StringBuilder for the 
moment.

>To rectify these points it's not clear to me if it would be quicker to use 
>your code or adapt the existing highlighter code to use spans.
>Thoughts? 

Depends entirely on what you think. I am sure I can fix all of the issues you 
mention (with a little advice <G>), but I am pretty new to this type of thing 
and perhaps you just want to start from scratch in order to achieve span 
highlighting with the existing highlighter. It may just be that the way I am 
doing this is not very compatible with the way you currently fragment and score.

I have added an updated Highlighter.java and HighlighterTest.java. The 
MemoryIndex problem remains...so it either has to be fixed or the modified 
MemoryIndex must be used.

- Mark m

> Beginnings of a span based highlighter
> --------------------------------------
>
>                 Key: LUCENE-794
>                 URL: https://issues.apache.org/jira/browse/LUCENE-794
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Other
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: DefaultEncoder.java, Encoder.java, Formatter.java, 
> Highlighter.java, Highlighter.java, HighlighterTest.java, 
> HighlighterTest.java, MemoryIndex.java, QuerySpansExtractor.java, 
> SimpleFormatter.java
>
>
> This is some test code to start the work of adding a span based highlighting 
> approach to the existing highlighter in contrib. See 
> http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-794) Beginnings of a span based highlighter

Reply via email to