Github user NightOwl888 commented on the issue:

    https://github.com/apache/lucenenet/pull/191
  
    > In your next steps section, is there anything required of me?
    
    Just to get this updated with master if you get the chance before I do, but 
it looks like you beat me to it.
    
    I am working on fixing the BreakIterator-related Highlighter tests now. 
Still have 7 that are failing. I am trying to identify all of the issues, so 
this list may not be complete, but this is what I have found so far.
    
    icu-dotnet Issues
    ----
    1. Sentence breaking not working when first word of sentence is lower case. 
In Java, both of the following return the same results (the latter):
    
    ```
    
?Icu.BreakIterator.GetBoundaries(Icu.BreakIterator.UBreakIteratorType.SENTENCE, 
new Icu.Locale("en-US"), "test this is.  another sentence this test has.  far 
away is that planet.")
    Count = 1
        [0]: {Start: [0], End: [72]}
    
?Icu.BreakIterator.GetBoundaries(Icu.BreakIterator.UBreakIteratorType.SENTENCE, 
new Icu.Locale("en-US"), "Test this is.  Another sentence this test has.  Far 
away is that planet.")
    Count = 3
        [0]: {Start: [0], End: [15]}
        [1]: {Start: [15], End: [48]}
        [2]: {Start: [48], End: [72]}
    ```
    2. Sentence incorrectly breaking when there is a `\n` in the string. In 
this string: `"any application that requires\nfull-text search, especially 
cross-platform. \nApache Lucene is an open source project available for free 
download."` we are expecting the first sentence to end at 76, but getting 30.
    3. Word breaking is happening on hyphenated words instead of treating them 
as a single word, for example, "high-performance" should be considered a single 
word, not 2 words.
    
    4. The ThaiWordBreaker class was added to work-around another BreakIterator 
difference from Java - namely that in Java Thai characters were broken into 
separate "words" if adjacent to non-Thai characters. For example 
"สวัสดีkrap", should break to "สวัสดี" and "krap". 
Ideally, icu-dotnet would handle this, but this solution is acceptable if that 
is unreasonable to do.
    5. If we are keeping the ThaiWordBreaker, I just ran the tests dealing with 
Thai numerals, and my assumption that those should be broken just like Thai 
characters was incorrect. So, the [these 
lines](https://github.com/conniey/lucenenet/blob/08453c16290465842866affa6f2fdd35517608b6/src/Lucene.Net.Analysis.Common/Analysis/Th/ThaiTokenizer.cs#L235-L236)
 should be changed to:
    ```
    isThai = char.IsLetter(c) && thaiPattern.IsMatch(c.ToString());
    isNonThai = char.IsLetter(c) && !isThai;
    ```
    You may wish to also change the variable names to `isThaiLetter`, 
`isNonThaiLetter`, etc. to make this more clear in the code. You can use the 
following test to verify the results.
    
    ```
    [Test, LuceneNetSpecific]
    public void TestNumeralBreaking() 
    {
          ThaiAnalyzer analyzer = new ThaiAnalyzer(TEST_VERSION_CURRENT, 
CharArraySet.EMPTY_SET);
          AssertAnalyzesTo(analyzer, "๑๒๓456", new String[] { 
"๑๒๓456" });
      }
    ```
    
    BreakIterator Dependencies
    ---
    
    Also, it seems like an easier path to setup the BreakIterator similar to 
the way it was in Java (as an abstract class), since it is being passed as 
method and constructor parameters and because it is meant to be an extension 
point where you can design your own word breaking if you need to customize the 
default ICU behavior. So, I ported the abstract `BreakIterator` class and have 
created a concrete `IcuBreakIterator` to wrap the icu-dotnet "BreakIterator" 
static functions (with the ability to pass locale and "type" to the 
constructor). I am still working on creating tests to verify the behavior 
against Java. Basically, the same "iterator" logic that exists in Java to move 
forward, backward, or arbitrarily through the break points is in this class.
    
    However, since this class depends directly on icu-dotnet, we can't just put 
it into our Support namespace without adding an icu-dotnet dependency to 
Lucene.Net.Core. And since other parts of Lucene (SimpleCN, ICU, 
Analysis.Common, etc) depend on BreakIterator functionality, it would be 
simpler to share this behavior if it were part of a common library. 
    
    While we could build our own, it would be an extra dependency that doesn't 
exist in Lucene. Ideally, it should go in icu-dotnet (since in Java it was part 
of the JDK, which icu-dotnet is emulating). If this functionality were in 
icu-dotnet, it would not just benefit the Lucene.Net project, but could 
potentially make other projects easier to port from Java. WDYT?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to