[ 
https://issues.apache.org/jira/browse/LUCENE-7729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937656#comment-15937656
 ] 

David Smiley commented on LUCENE-7729:
--------------------------------------

BTW in multiple places in your code as comments plus this issue commentary I've 
seen this: {{len > 0}} (as a comment) but in all cases you probably mean {{len 
> 1}}?

RE resetting a failed match: Good point that your patch addresses the specific 
example I gave, and apparently any separator of length 2. Let me give a better 
example of length 3:  {{aab}} would fail to match {{aaab}}.  I just wrote a 
test for that to confirm it failed.  Here's another example of length 4 that 
may be more clear:  A separator of {{acab}} would fail to be detected in 
{{acacab}}.

testBreakOnCustomSeparator: you commented out a couple assertions because they 
didn't apply if the separator is > 1 length.  Instead you could add a condition 
to only test when length 1.

RE my proposed single char constructor org: this is just syntactic sugar (i.e. 
convenience).  A bunch of changes in your diff would then be able to stay the 
same.

bq. I observed the code and understood it will not require major refactoring to 
change the current implementation for arbitrary length string.

Yeah I figured.  I envy the time you have on your hands to implement a feature 
that nobody has (yet) asked for :-)  To be clear, I never asked or recommended. 
 I sometimes work on something fun to me too; scratch some itch.

Speaking of scratching itches... check out SimplePatternTokenizer (recently 
added to Lucene) and how it works with an Automaton.  Now I'm sure *that* would 
be useful to users; the original Highlighter (via Solr at least) had a regexp 
passage splitter.  One possible direction you might take is to leave 
CustomSeparatorBreakIterator be and instead do one taking a regexp/automaton... 
and then if some user wants to split on a string then they could use this guy.

> Support for string type separator for CustomSeparatorBreakIterator
> ------------------------------------------------------------------
>
>                 Key: LUCENE-7729
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7729
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Amrit Sarkar
>         Attachments: LUCENE-7729.patch, LUCENE-7729.patch
>
>
> LUCENE-6485: currently CustomSeparatorBreakIterator breaks the text when the 
> _char_ passed is found.
> Improved CustomSeparatorBreakIterator; as it now supports separator of string 
> type of arbitrary length.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to