[ 
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804338#comment-15804338
 ] 

Jim Ferenczi commented on LUCENE-7620:
--------------------------------------

It looks good [~dsmiley] ! I've started to work on something similar but got 
caught into something else ;)
Though I wonder if we should also break the sentence if it's too long ? Maybe 
the wrapped breakiterator could always be a sentence one and we could use a 
WordBreakIterator to cut sentences that are too long ? This way it would 
produce snippets that are similar to the SimpleFragmenter.
It could also be done in another breakiterator on top of this one but this 
would make things over complicated, I guess.
For the implementation can you throw an exception on the method that should not 
be called ? For instance {noformat}next(n){noformat} cannot be implemented 
efficiently (you need to start from 0 if you want to know the Nth boundary) but 
currently it returns the Nth boundary of the wrapped break iterator. I think 
it's better to throw an exception, this way it is obvious that some methods 
should not be called. 

Additionally I think that we should have a way to change the start and end of a 
passage when we know all the match that it contains. This is what the FVH is 
doing and it should be doable in the UH because the passage are created on the 
fly in forward manner. This is of course not the purpose of this issue and it 
should be treated as a new feature but I think it would be great to have the 
same output than the FVH when the max length of the passage is set. 




> UnifiedHighlighter: add target character width BreakIterator wrapper
> --------------------------------------------------------------------
>
>                 Key: LUCENE-7620
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7620
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>         Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates 
> fragments (aka Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  
> It's useful in its own right and of course it helps users transition to the 
> UH.  I'd like to do it as a wrapper to another BreakIterator -- perhaps a 
> sentence one.  In this way you get back Passages that are a number of 
> sentences so they will look nice instead of breaking mid-way through a 
> sentence.  And you get some control by specifying a target number of 
> characters.  This BreakIterator wouldn't be a general purpose 
> java.text.BreakIterator since it would assume it's called in a manner exactly 
> as the UnifiedHighlighter uses it.  It would probably be compatible with the 
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your 
> BreakIterator config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to