Github user conniey commented on the issue: https://github.com/apache/lucenenet/pull/191 1. Sentence breaking not working when first word of sentence is lower case. * According to the [sentence boundary rules](http://www.unicode.org/reports/tr29/#Sentence_Boundary_Rules) icu follows, it is returning the correct sentence breaks. (It is defined in the section "Do not break after full stop in certain contexts. [See note below.]"). 2. The response for 1 also applies, where it is breaking prematurely on new-lines. 3. Word breaking is happening on hyphenated words instead of treating them as a single word, for example, "high-performance" should be considered a single word, not 2 words. * According to their [word break rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules), we are returning the expected behaviour. The hyphens that are visualised are breaking hyphens, but if we had added a soft hyphen, it would not have broken the word. 4. "The ThaiWordBreaker class was added to work-around another BreakIterator difference from Java - namely that in Java Thai characters were broken into separate "words" if adjacent to non-Thai characters." * Unfortunately, this is due to the word breaking rules in ICU since it sees these as part of the same word since they are characters. One way to fix the points above is to use a RuleBasedBreakIterator and modify the default rules for creating a break iterator. Would that work for Lucene.NET? I would have to add a native method to icu-dotnet to call to [ubrk_openRules](http://icu-project.org/apiref/icu4c/ubrk_8h.html#a11826cb21213916c2d91579b673d8949) to let you create a BreakIterator. The default rules are here: * [Sentence rules](http://source.icu-project.org/repos/icu/tags/release-54-1/icu4c/source/data/brkitr/sent.txt) * [Word rules](http://source.icu-project.org/repos/icu/tags/release-54-1/icu4c/source/data/brkitr/word.txt) * [Blog post on creating custom rules](http://sujitpal.blogspot.com/2008/05/tokenizing-text-with-icu4js.html) 5. I updated ThaiTokenizer with your code snippet and tested it against TestNumeralBreakages RE: BreakIterator Dependencies * I agree that it should be an abstract class and have more functionality (ie. moving backwards and forwards) similar to its Java counterpart. I'll see about writing a PR and submitting it to [sillsdev/icu-dotnet](https://github.com/sillsdev/icu-dotnet) to see if they will accept this feature.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---