Re: [I] When will the 4.8.0 version be released？ [lucenenet]

via GitHub Thu, 26 Oct 2023 06:36:37 -0700


NightOwl888 commented on issue #793:
URL: https://github.com/apache/lucenenet/issues/793#issuecomment-1781144387

> So I guess what you are saying is we can't have a "stable" Lucene.NET
release unless its dependencies are stable and currently
[Lucene.NET.ICU](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/icu/overview.html)
is a work in progress with a changing API surface.

Not exactly. We could do a release if we go over the API surface of the core
and other completed components to finalize it AND build a multi-release scheme
so we have 2 different release labels, one for the stable components and one
for the unstable components. While the API work is something we have to do
anyway, changing the build, release policy, Git labeling scheme, etc. isn't
exactly free.

`Lucene.Net.ICU` will likely change because the `CharacterIterator` still
needs to be converted to a .NETified component and put into J2N (right now it
exists in `ICU4N.Support`, which is meant to go away from the public API).
`CharacterEnumerator` was made for this purpose, but it had to be commented out
because I couldn't get it working on Lucene.NET components although it worked
fine in ICU4N. This modification will definitely break the public API. I don't
think there are any other things that will break it, though.

> I'm reading into that, [ICU4N](https://github.com/NightOwl888/ICU4N),
which Lucene.NET.ICU depends on, is also probably a work in progress. And it's
certainly worth noting that ICU support is something the Java Lucene team got
for free in the JDK that unfortunately isn't included in the .NET Framework
(full or core). Hence the need to create ICU4N to provide that support. A
nontrivial endeavor in its own right.

Yes, ICU4N is still a work in progress. There are several tests that either
still fail, often due to gaps that we haven't yet covered. There are also some
concurrency bugs to track down. Since it is only a partial port, we have lots
of tests to go through that might be able to be ported, as well. The intention
is not to port any more of the production code (except for perhaps some of the
formatters and parsers because that is where most of its funding has come from
so far).

The ICU4J functionality is not in the JDK. Instead ICU4N is a port of ICU4J.
But it is hard to integrate because the gap between Java and ICU4J is not the
same as the gap between .NET and ICU4N. Although, it is made easier because ICU
is documented pretty well.

In short ICU4/J extend the text processing capabilities of .NET and Java by
providing rules-based versions of some of the included components (such as the
`CompareInfo` .NET class which corresponds to the more powerful
`RuleBasedCollator` in ICU4N). These components allow you to control the
behavior in custom ways that simply can't be done on the raw .NET or JDK
platforms. There are also many other features that are super valuable, such as
the `UnicodeSet` which can be used like a regex character class but is much
more powerful (it can even be passed a string to match all of the characters in
a specific version of Unicode).

We use the ICU4N `BreakIterator` in all cases where the JDK `BreakIterator`
is required because .NET is totally lacking this feature (even though it
depends on ICU now, the API for this is not exposed anywhere). This has also
caused some compatibility issues because of differences between how ICU4J and
the JDK behave, so we had to patch the `ThaiAnalyzer` and basically write our
own tests for some of the highlighters. Unfortunately, the highlighters won't
work exactly the same unless we do the research to work out what to recommend
as the "JDK format" by providing custom rules that correspond to the Java
behavior.

> But when I review the docs for
[Lucene.Net.ICU](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/icu/overview.html)
and see what's included, it feels very central to a search library and
encompasses such basic functionality as finding word boundaries and line break
boundaries. While this seems trivial in languages like English it's anything
but trivial in languages like Chinese 要弄清楚如何分解中文單字是很困難的。or Japanese
中国語で単語を区切る方法を理解するのは難しいです.
>
> Given that a great many of the developers using Lucene.NET only use it for
English text, or other languages that use the Latin alphabet, it's easy to see
how we can sometimes lose sight of what ICU is and why it's so important. Based
on your post, I now better understand why Lucene.NET hasn't had a public
release yet. Still, it seems very unfortunate that such a stable product (at
least for indexing Latin languages) has a current version (beta) that doesn't
indicate it's production-ready for Latin languages.

Actually, there are several use cases that even make it valuable even to
Western European languages. For example, for removing diacritics from words. In
.NET, this cannot be done without [a
hack](https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net)
because the normalization feature is missing the case fold option that ICU
has. I have seen many people post this hack in their questions about Lucene.NET
even though they could just use the `ICUFoldingFilter` or
`ICUNormalizer2Filter` instead.

These make it so words with accent characters such as resume, résumé, and
resumé all normalize to the same root word for searches.

Although the components inside of the `Lucene.Net.ICU` assembly are indeed
valuable as is, the real value is in using ICU4N to build custom analysis
components.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] When will the 4.8.0 version be released？ [lucenenet]

Reply via email to