NightOwl888 commented on issue #793: URL: https://github.com/apache/lucenenet/issues/793#issuecomment-1781144387
> So I guess what you are saying is we can't have a "stable" Lucene.NET release unless its dependencies are stable and currently [Lucene.NET.ICU](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/icu/overview.html) is a work in progress with a changing API surface. Not exactly. We could do a release if we go over the API surface of the core and other completed components to finalize it AND build a multi-release scheme so we have 2 different release labels, one for the stable components and one for the unstable components. While the API work is something we have to do anyway, changing the build, release policy, Git labeling scheme, etc. isn't exactly free. `Lucene.Net.ICU` will likely change because the `CharacterIterator` still needs to be converted to a .NETified component and put into J2N (right now it exists in `ICU4N.Support`, which is meant to go away from the public API). `CharacterEnumerator` was made for this purpose, but it had to be commented out because I couldn't get it working on Lucene.NET components although it worked fine in ICU4N. This modification will definitely break the public API. I don't think there are any other things that will break it, though. > I'm reading into that, [ICU4N](https://github.com/NightOwl888/ICU4N), which Lucene.NET.ICU depends on, is also probably a work in progress. And it's certainly worth noting that ICU support is something the Java Lucene team got for free in the JDK that unfortunately isn't included in the .NET Framework (full or core). Hence the need to create ICU4N to provide that support. A nontrivial endeavor in its own right. Yes, ICU4N is still a work in progress. There are several tests that either still fail, often due to gaps that we haven't yet covered. There are also some concurrency bugs to track down. Since it is only a partial port, we have lots of tests to go through that might be able to be ported, as well. The intention is not to port any more of the production code (except for perhaps some of the formatters and parsers because that is where most of its funding has come from so far). The ICU4J functionality is not in the JDK. Instead ICU4N is a port of ICU4J. But it is hard to integrate because the gap between Java and ICU4J is not the same as the gap between .NET and ICU4N. Although, it is made easier because ICU is documented pretty well. In short ICU4/J extend the text processing capabilities of .NET and Java by providing rules-based versions of some of the included components (such as the `CompareInfo` .NET class which corresponds to the more powerful `RuleBasedCollator` in ICU4N). These components allow you to control the behavior in custom ways that simply can't be done on the raw .NET or JDK platforms. There are also many other features that are super valuable, such as the `UnicodeSet` which can be used like a regex character class but is much more powerful (it can even be passed a string to match all of the characters in a specific version of Unicode). We use the ICU4N `BreakIterator` in all cases where the JDK `BreakIterator` is required because .NET is totally lacking this feature (even though it depends on ICU now, the API for this is not exposed anywhere). This has also caused some compatibility issues because of differences between how ICU4J and the JDK behave, so we had to patch the `ThaiAnalyzer` and basically write our own tests for some of the highlighters. Unfortunately, the highlighters won't work exactly the same unless we do the research to work out what to recommend as the "JDK format" by providing custom rules that correspond to the Java behavior. > But when I review the docs for [Lucene.Net.ICU](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/icu/overview.html) and see what's included, it feels very central to a search library and encompasses such basic functionality as finding word boundaries and line break boundaries. While this seems trivial in languages like English it's anything but trivial in languages like Chinese 要弄清楚如何分解中文單字是很困難的。or Japanese 中国語で単語を区切る方法を理解するのは難しいです. > > Given that a great many of the developers using Lucene.NET only use it for English text, or other languages that use the Latin alphabet, it's easy to see how we can sometimes lose sight of what ICU is and why it's so important. Based on your post, I now better understand why Lucene.NET hasn't had a public release yet. Still, it seems very unfortunate that such a stable product (at least for indexing Latin languages) has a current version (beta) that doesn't indicate it's production-ready for Latin languages. Actually, there are several use cases that even make it valuable even to Western European languages. For example, for removing diacritics from words. In .NET, this cannot be done without [a hack](https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net) because the normalization feature is missing the case fold option that ICU has. I have seen many people post this hack in their questions about Lucene.NET even though they could just use the `ICUFoldingFilter` or `ICUNormalizer2Filter` instead. These make it so words with accent characters such as resume, résumé, and resumé all normalize to the same root word for searches. Although the components inside of the `Lucene.Net.ICU` assembly are indeed valuable as is, the real value is in using ICU4N to build custom analysis components. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org