Github user NightOwl888 commented on the issue: https://github.com/apache/lucenenet/pull/191 > In your next steps section, is there anything required of me? Just to get this updated with master if you get the chance before I do, but it looks like you beat me to it. I am working on fixing the BreakIterator-related Highlighter tests now. Still have 7 that are failing. I am trying to identify all of the issues, so this list may not be complete, but this is what I have found so far. icu-dotnet Issues ---- 1. Sentence breaking not working when first word of sentence is lower case. In Java, both of the following return the same results (the latter): ``` ?Icu.BreakIterator.GetBoundaries(Icu.BreakIterator.UBreakIteratorType.SENTENCE, new Icu.Locale("en-US"), "test this is. another sentence this test has. far away is that planet.") Count = 1 [0]: {Start: [0], End: [72]} ?Icu.BreakIterator.GetBoundaries(Icu.BreakIterator.UBreakIteratorType.SENTENCE, new Icu.Locale("en-US"), "Test this is. Another sentence this test has. Far away is that planet.") Count = 3 [0]: {Start: [0], End: [15]} [1]: {Start: [15], End: [48]} [2]: {Start: [48], End: [72]} ``` 2. Sentence incorrectly breaking when there is a `\n` in the string. In this string: `"any application that requires\nfull-text search, especially cross-platform. \nApache Lucene is an open source project available for free download."` we are expecting the first sentence to end at 76, but getting 30. 3. Word breaking is happening on hyphenated words instead of treating them as a single word, for example, "high-performance" should be considered a single word, not 2 words. 4. The ThaiWordBreaker class was added to work-around another BreakIterator difference from Java - namely that in Java Thai characters were broken into separate "words" if adjacent to non-Thai characters. For example "สวัสà¸à¸µkrap", should break to "สวัสà¸à¸µ" and "krap". Ideally, icu-dotnet would handle this, but this solution is acceptable if that is unreasonable to do. 5. If we are keeping the ThaiWordBreaker, I just ran the tests dealing with Thai numerals, and my assumption that those should be broken just like Thai characters was incorrect. So, the [these lines](https://github.com/conniey/lucenenet/blob/08453c16290465842866affa6f2fdd35517608b6/src/Lucene.Net.Analysis.Common/Analysis/Th/ThaiTokenizer.cs#L235-L236) should be changed to: ``` isThai = char.IsLetter(c) && thaiPattern.IsMatch(c.ToString()); isNonThai = char.IsLetter(c) && !isThai; ``` You may wish to also change the variable names to `isThaiLetter`, `isNonThaiLetter`, etc. to make this more clear in the code. You can use the following test to verify the results. ``` [Test, LuceneNetSpecific] public void TestNumeralBreaking() { ThaiAnalyzer analyzer = new ThaiAnalyzer(TEST_VERSION_CURRENT, CharArraySet.EMPTY_SET); AssertAnalyzesTo(analyzer, "à¹à¹à¹456", new String[] { "à¹à¹à¹456" }); } ``` BreakIterator Dependencies --- Also, it seems like an easier path to setup the BreakIterator similar to the way it was in Java (as an abstract class), since it is being passed as method and constructor parameters and because it is meant to be an extension point where you can design your own word breaking if you need to customize the default ICU behavior. So, I ported the abstract `BreakIterator` class and have created a concrete `IcuBreakIterator` to wrap the icu-dotnet "BreakIterator" static functions (with the ability to pass locale and "type" to the constructor). I am still working on creating tests to verify the behavior against Java. Basically, the same "iterator" logic that exists in Java to move forward, backward, or arbitrarily through the break points is in this class. However, since this class depends directly on icu-dotnet, we can't just put it into our Support namespace without adding an icu-dotnet dependency to Lucene.Net.Core. And since other parts of Lucene (SimpleCN, ICU, Analysis.Common, etc) depend on BreakIterator functionality, it would be simpler to share this behavior if it were part of a common library. While we could build our own, it would be an extra dependency that doesn't exist in Lucene. Ideally, it should go in icu-dotnet (since in Java it was part of the JDK, which icu-dotnet is emulating). If this functionality were in icu-dotnet, it would not just benefit the Lucene.Net project, but could potentially make other projects easier to port from Java. WDYT?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---