[
https://issues.apache.org/jira/browse/LUCENE-7729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Amrit Sarkar updated LUCENE-7729:
---------------------------------
Attachment: LUCENE-7729.patch
Thank you David for looking into this. Updated: LUCENE-7729.patch
bq. One issue with the implementation I see is that if it starts to find a
match but ultimately doesn't, then the position is not reset back to the start
(plus 1). This means hypothetically a string separator of ab would fail to find
the substring in the input aab. I didn't try with your patch but do you concur?
It is taken care in the following section in the original patch:
CustomSeparatorBreakIterator::advanceForward()::72
CustomSeparatorBreakIterator::advanceBackward()::121
{code}
if(sep_index != separator.length() - 1) { // separator len > 1
sep_index = separator.length() - 1;
if(c == separator.charAt(sep_index)){ //check the current token match
with last element
sep_index --;
}
}
{code}
{code}
if(sep_index != 0) { //separator len > 0
sep_index = 0;
if (c == separator.charAt(sep_index)) { //check the current token
match with first element
sep_index ++;
}
}
{code}
I have added relevant test cases to prove the same:
TestCustomSeparatorBreakIterator::testFollowingPrecedingBreakOnCustomSeparator::100
{code}separator = "xz";{code}
bq. I'm a little concerned about possible overhead for this mode. Maybe
subclassing to override advanceForward and advanceBackward would make sense. If
this were an inner class to do the string, then a factory method instead of
constructor could be used. I think CustomSeparatorBreakIterator should continue
to accept a single char constructor arg; I imagine most uses of this would
remain to be one character.
I am not able to find an overhead for this implementation. String of length>0
is acceptable which is kind of better than just single char, no? I understand
the most use cases will not demand more than single char, that's why we
specially have whitespace, but having string arg as default brings no harm as
internally char-by-char matching is done.
Thank you for the valuable coding standard tips too. Ishan corrected me on
above stated points on other JIRA and it slipped my mind that I already
attached a patch for this one with improper indentation and style. I will take
care of this in future for sure.
> Support for string type separator for CustomSeparatorBreakIterator
> ------------------------------------------------------------------
>
> Key: LUCENE-7729
> URL: https://issues.apache.org/jira/browse/LUCENE-7729
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: Amrit Sarkar
> Attachments: LUCENE-7729.patch, LUCENE-7729.patch
>
>
> LUCENE-6485: currently CustomSeparatorBreakIterator breaks the text when the
> _char_ passed is found.
> Improved CustomSeparatorBreakIterator; as it now supports separator of string
> type of arbitrary length.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]