[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

GitBox Mon, 31 Oct 2022 08:49:41 -0700


NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1297294020

Thanks for the report. Yeah, the Thai Tokenizer is a tough nut. .NET has
nothing like a `BreakIterator`, and we can't port the one from the JDK due to
licensing restrictions. Apache Harmony (which is a JRE that has an Apache
license) used the ICU4J `BreakIterator`. We tried the same thing, but it
behaves differently than the JDK.

In fact, it isn't just the `ThaiTokenizer` that is affected, it affects the
highlighters also.

Unfortunately, `.brk` files don't apply to dictionary-based break iterators,
for those you have to use `.dict` files. For sure they cannot be loaded using
the `BreakIterator` API, but I don't recall if there is another API that can be
used to load them or if it will actually take a custom compile of ICU4N to be
able to pull it off. See the [ICU4J Resource
Information](https://unicode-org.github.io/icu/userguide/icu4j/#icu4j-resource-information).
I think there might have been a way to use class paths to load custom ones,
but I am still trying to learn how these things are done "normally" in Java.

Trying to come up with a reasonable way to work with resource files in ICU4N
is still a work in progress, since the architecture for loading them in .NET is
completely different. There is an attempt
[here](https://github.com/NightOwl888/ICU4N/commits/feature/resource-automation)
to migrate to using satellite assemblies for the localized resources. However,
that still means the `.dict` files would be embedded in the `ICU4N.dll`
assembly.

There is also an attempt to replace system properties with
`Microsoft.Extensions.Configuration` similarly to how we did it in Lucene.NET
on another branch that hasn't been pushed yet. It may be relevant to how to
inject resources into ICU4N, don't recall. Still a mess, went down a rabbit
hole chasing a `ThreadAbortException` being thrown from the .NET
`StringBuilder` class for seemingly no reason, before giving up 8 months ago.
However, I haven't yet tried to go back to find the commit where it started.

We were able to get the Lucene tests to pass by adding a
[ThaiWordBreaker](https://github.com/apache/lucenenet/blob/e72315a75009854483c979462eb2406f41311796/src/Lucene.Net.Analysis.Common/Analysis/Th/ThaiTokenizer.cs#L212-L309)
class that separates Thai letters from Thai numbers (and other text). This
brings us a bit closer to JDK behavior. However, we don't have a thorough set
of tests or documentation (that we could find) to determine all of the
differences between the two implementations to close the gap. Of course, if all
of it can be done using a different `.dict` file, that would be preferable to
adding classes to mimic JDK behavior.

If you could share any information you can find - more tests to show
differences in behavior, info on how custom `.dict` files are loaded in ICU4J
by users (without recompiling), info about how to create `.dict` files, etc.,
it would certainly help move us forward.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Reply via email to