[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12470098 ]
Mark Miller commented on LUCENE-794: ------------------------------------ Sorry about all that Mark H. This was literally just some test code that I quickly shoved into an api similar to your existing highlighter. If you decided that it should be something considered on it's own I would certainly have quite a bit further to go. Mostly I just put it up for your evaluation on extending the current highlighter with this highlight method. >1) Fieldname "contents" shouldn't be hardcoded into the Highlighter - >different analyzers can behave differently for different fields (see >>PerFieldAnalyzerWrapper). Either pass a fieldname parameter or do as the >existing highlighter does and take a TokenStream. The latter approach >has the >advantage of being able to avoid re-analysis and make use of any stored >TermVectors (see TokenSources.java) I don't have a great solution for this right now. I need to read the TokenStream at least twice due to the MemoryIndex extracting the spans. Unfortunately, it seems I can copy the tokens to a list or pass them to the MemoryIndex -- I cannot do both. The MemoryIndex is also looking for a field name...so while I changed the api to take a TokenStream, I have not resolved also needing the field name. I am hoping you have some good comments. To get around reading the TokenStream twice I used the horribly hackey but quick-for-me method of adding a method to MemoryIndex that accepts a List of Tokens. Any ideas? 2) Analyzers which produce overlapping tokens (see Synonym analyzer in existing highlighter Junit test) are problematic in the existing code. I remember the "TokenGroup" class in the existing highlighter was an approach to help cater for these "overlap" scenarios. I always attack this last <G>. Seems a simple fix: if position increment equals 0 skip printing out the token. It passes your test which I have added to my test code, but I am not totally confident it is perfect yet. 3) Without wishing to resurrect the whole 1.4 vs 1.5 debate I beleive Lucene still targets Java 1.4. Just me being lazy. I swear I have seen Contrib stuff that says 1.5. I have gone through and stripped out all of the 1.4 except for StringBuilder for the moment. >To rectify these points it's not clear to me if it would be quicker to use >your code or adapt the existing highlighter code to use spans. >Thoughts? Depends entirely on what you think. I am sure I can fix all of the issues you mention (with a little advice <G>), but I am pretty new to this type of thing and perhaps you just want to start from scratch in order to achieve span highlighting with the existing highlighter. It may just be that the way I am doing this is not very compatible with the way you currently fragment and score. I have added an updated Highlighter.java and HighlighterTest.java. The MemoryIndex problem remains...so it either has to be fixed or the modified MemoryIndex must be used. - Mark m > Beginnings of a span based highlighter > -------------------------------------- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other > Reporter: Mark Miller > Priority: Minor > Attachments: DefaultEncoder.java, Encoder.java, Formatter.java, > Highlighter.java, Highlighter.java, HighlighterTest.java, > HighlighterTest.java, MemoryIndex.java, QuerySpansExtractor.java, > SimpleFormatter.java > > > This is some test code to start the work of adding a span based highlighting > approach to the existing highlighter in contrib. See > http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]