NightOwl888 commented on issue #732: URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1297294020
Thanks for the report. Yeah, the Thai Tokenizer is a tough nut. .NET has nothing like a `BreakIterator`, and we can't port the one from the JDK due to licensing restrictions. Apache Harmony (which is a JRE that has an Apache license) used the ICU4J `BreakIterator`. We tried the same thing, but it behaves differently than the JDK. In fact, it isn't just the `ThaiTokenizer` that is affected, it affects the highlighters also. Unfortunately, `.brk` files don't apply to dictionary-based break iterators, for those you have to use `.dict` files. For sure they cannot be loaded using the `BreakIterator` API, but I don't recall if there is another API that can be used to load them or if it will actually take a custom compile of ICU4N to be able to pull it off. See the [ICU4J Resource Information](https://unicode-org.github.io/icu/userguide/icu4j/#icu4j-resource-information). I think there might have been a way to use class paths to load custom ones, but I am still trying to learn how these things are done "normally" in Java. Trying to come up with a reasonable way to work with resource files in ICU4N is still a work in progress, since the architecture for loading them in .NET is completely different. There is an attempt [here](https://github.com/NightOwl888/ICU4N/commits/feature/resource-automation) to migrate to using satellite assemblies for the localized resources. However, that still means the `.dict` files would be embedded in the `ICU4N.dll` assembly. There is also an attempt to replace system properties with `Microsoft.Extensions.Configuration` similarly to how we did it in Lucene.NET on another branch that hasn't been pushed yet. It may be relevant to how to inject resources into ICU4N, don't recall. Still a mess, went down a rabbit hole chasing a `ThreadAbortException` being thrown from the .NET `StringBuilder` class for seemingly no reason, before giving up 8 months ago. However, I haven't yet tried to go back to find the commit where it started. We were able to get the Lucene tests to pass by adding a [ThaiWordBreaker](https://github.com/apache/lucenenet/blob/e72315a75009854483c979462eb2406f41311796/src/Lucene.Net.Analysis.Common/Analysis/Th/ThaiTokenizer.cs#L212-L309) class that separates Thai letters from Thai numbers (and other text). This brings us a bit closer to JDK behavior. However, we don't have a thorough set of tests or documentation (that we could find) to determine all of the differences between the two implementations to close the gap. Of course, if all of it can be done using a different `.dict` file, that would be preferable to adding classes to mimic JDK behavior. If you could share any information you can find - more tests to show differences in behavior, info on how custom `.dict` files are loaded in ICU4J by users (without recompiling), info about how to create `.dict` files, etc., it would certainly help move us forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org