NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1297294020

   Thanks for the report. Yeah, the Thai Tokenizer is a tough nut. .NET has 
nothing like a `BreakIterator`, and we can't port the one from the JDK due to 
licensing restrictions. Apache Harmony (which is a JRE that has an Apache 
license) used the ICU4J `BreakIterator`. We tried the same thing, but it 
behaves differently than the JDK.
   
   In fact, it isn't just the `ThaiTokenizer` that is affected, it affects the 
highlighters also.
   
   Unfortunately, `.brk` files don't apply to dictionary-based break iterators, 
for those you have to use `.dict` files. For sure they cannot be loaded using 
the `BreakIterator` API, but I don't recall if there is another API that can be 
used to load them or if it will actually take a custom compile of ICU4N to be 
able to pull it off. See the [ICU4J Resource 
Information](https://unicode-org.github.io/icu/userguide/icu4j/#icu4j-resource-information).
 I think there might have been a way to use class paths to load custom ones, 
but I am still trying to learn how these things are done "normally" in Java.
   
   Trying to come up with a reasonable way to work with resource files in ICU4N 
is still a work in progress, since the architecture for loading them in .NET is 
completely different. There is an attempt 
[here](https://github.com/NightOwl888/ICU4N/commits/feature/resource-automation)
 to migrate to using satellite assemblies for the localized resources. However, 
that still means the `.dict` files would be embedded in the `ICU4N.dll` 
assembly.
   
   There is also an attempt to replace system properties with 
`Microsoft.Extensions.Configuration` similarly to how we did it in Lucene.NET 
on another branch that hasn't been pushed yet. It may be relevant to how to 
inject resources into ICU4N, don't recall. Still a mess, went down a rabbit 
hole chasing a `ThreadAbortException` being thrown from the .NET 
`StringBuilder` class for seemingly no reason, before giving up 8 months ago. 
However, I haven't yet tried to go back to find the commit where it started.
   
   We were able to get the Lucene tests to pass by adding a 
[ThaiWordBreaker](https://github.com/apache/lucenenet/blob/e72315a75009854483c979462eb2406f41311796/src/Lucene.Net.Analysis.Common/Analysis/Th/ThaiTokenizer.cs#L212-L309)
 class that separates Thai letters from Thai numbers (and other text). This 
brings us a bit closer to JDK behavior. However, we don't have a thorough set 
of tests or documentation (that we could find) to determine all of the 
differences between the two implementations to close the gap. Of course, if all 
of it can be done using a different `.dict` file, that would be preferable to 
adding classes to mimic JDK behavior.
   
   If you could share any information you can find - more tests to show 
differences in behavior, info on how custom `.dict` files are loaded in ICU4J 
by users (without recompiling), info about how to create `.dict` files, etc., 
it would certainly help move us forward.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to