[jira] [Updated] (LUCENE-7729) Support for string type separator for CustomSeparatorBreakIterator

Amrit Sarkar (JIRA) Wed, 22 Mar 2017 13:20:07 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Amrit Sarkar updated LUCENE-7729:
---------------------------------
    Attachment: LUCENE-7729.patch

Thank you David for looking into this. Updated: LUCENE-7729.patch

bq. One issue with the implementation I see is that if it starts to find a 
match but ultimately doesn't, then the position is not reset back to the start 
(plus 1). This means hypothetically a string separator of ab would fail to find 
the substring in the input aab. I didn't try with your patch but do you concur? 

It is taken care in the following section in the original patch:
CustomSeparatorBreakIterator::advanceForward()::72
CustomSeparatorBreakIterator::advanceBackward()::121

{code}
if(sep_index != separator.length() - 1) { // separator len > 1
          sep_index = separator.length() - 1;
          if(c == separator.charAt(sep_index)){ //check the current token match 
with last element
            sep_index --;
          }
        }
{code}
{code}
if(sep_index != 0) { //separator len > 0
          sep_index = 0;
          if (c == separator.charAt(sep_index)) { //check the current token 
match with first element
            sep_index ++;
          }
        }
{code}

I have added relevant test cases to prove the same:

TestCustomSeparatorBreakIterator::testFollowingPrecedingBreakOnCustomSeparator::100
{code}separator = "xz";{code}

bq. I'm a little concerned about possible overhead for this mode. Maybe 
subclassing to override advanceForward and advanceBackward would make sense. If 
this were an inner class to do the string, then a factory method instead of 
constructor could be used. I think CustomSeparatorBreakIterator should continue 
to accept a single char constructor arg; I imagine most uses of this would 
remain to be one character.

I am not able to find an overhead for this implementation. String of length>0 
is acceptable which is kind of better than just single char, no? I understand 
the most use cases will not demand more than single char, that's why we 
specially have whitespace, but having string arg as default brings no harm as 
internally char-by-char matching is done.

Thank you for the valuable coding standard tips too. Ishan corrected me on 
above stated points on other JIRA and it slipped my mind that I already 
attached a patch for this one with improper indentation and style. I will take 
care of this in future for sure.

> Support for string type separator for CustomSeparatorBreakIterator
> ------------------------------------------------------------------
>
>                 Key: LUCENE-7729
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7729
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Amrit Sarkar
>         Attachments: LUCENE-7729.patch, LUCENE-7729.patch
>
>
> LUCENE-6485: currently CustomSeparatorBreakIterator breaks the text when the 
> _char_ passed is found.
> Improved CustomSeparatorBreakIterator; as it now supports separator of string 
> type of arbitrary length.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-7729) Support for string type separator for CustomSeparatorBreakIterator

Reply via email to