[
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804800#comment-15804800
]
David Smiley commented on LUCENE-7620:
--------------------------------------
bq. Though I wonder if we should also break the sentence if it's too long ?
Maybe the wrapped breakiterator could always be a sentence one and we could use
a WordBreakIterator to cut sentences that are too long ? This way it would
produce snippets that are similar to the SimpleFragmenter.
It could also be done in another breakiterator on top of this one but this
would make things over complicated, I guess.
By choosing a lengthGoal on the low side; maybe "too long" will tend not to be
a problem? Or see my TODO at the top of the file -- essentially choose the
break that is closest to the goal instead of always the first following it.
Maybe I'll add that in my next patch.
I don't think we should try to emulate SimpleFragmenter exactly. We can do a
much better job ;-) I like this implementation as a wrapper BreakIterator....
perhaps we'll add a Regex BI one day and then it would simply fit right in.
bq. For the implementation can you throw an exception on the method that should
not be called ? For instance ...(etc)
Yeah I could go either way on that... how about {{assert false : "not
supported/expected";}}?
bq. Additionally I think that we should have a way to change the start and end
of a passage when we know all the match that it contains. This is what the FVH
is doing and it should be doable in the UH because the passage are created on
the fly in forward manner. This is of course not the purpose of this issue and
it should be treated as a new feature but I think it would be great to have the
same output than the FVH when the max length of the passage is set.
Definitely a separate issue. It wouldn't fit into the BreakIterator
abstraction either. Maybe some Passage post-processor like thing. Or maybe
simply expose sufficient hooks to allow subclassers to do this. That keeps the
UH simpler.
> UnifiedHighlighter: add target character width BreakIterator wrapper
> --------------------------------------------------------------------
>
> Key: LUCENE-7620
> URL: https://issues.apache.org/jira/browse/LUCENE-7620
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: David Smiley
> Assignee: David Smiley
> Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates
> fragments (aka Passages) by a character width. The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.
> It's useful in its own right and of course it helps users transition to the
> UH. I'd like to do it as a wrapper to another BreakIterator -- perhaps a
> sentence one. In this way you get back Passages that are a number of
> sentences so they will look nice instead of breaking mid-way through a
> sentence. And you get some control by specifying a target number of
> characters. This BreakIterator wouldn't be a general purpose
> java.text.BreakIterator since it would assume it's called in a manner exactly
> as the UnifiedHighlighter uses it. It would probably be compatible with the
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your
> BreakIterator config.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]