NightOwl888 commented on issue #793:
URL: https://github.com/apache/lucenenet/issues/793#issuecomment-1781144387

   > So I guess what you are saying is we can't have a "stable" Lucene.NET 
release unless its dependencies are stable and currently 
[Lucene.NET.ICU](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/icu/overview.html)
 is a work in progress with a changing API surface.
   
   Not exactly. We could do a release if we go over the API surface of the core 
and other completed components to finalize it AND build a multi-release scheme 
so we have 2 different release labels, one for the stable components and one 
for the unstable components. While the API work is something we have to do 
anyway, changing the build, release policy, Git labeling scheme, etc. isn't 
exactly free.
   
   `Lucene.Net.ICU` will likely change because the `CharacterIterator` still 
needs to be converted to a .NETified component and put into J2N (right now it 
exists in `ICU4N.Support`, which is meant to go away from the public API). 
`CharacterEnumerator` was made for this purpose, but it had to be commented out 
because I couldn't get it working on Lucene.NET components although it worked 
fine in ICU4N. This modification will definitely break the public API. I don't 
think there are any other things that will break it, though.
   
   > I'm reading into that, [ICU4N](https://github.com/NightOwl888/ICU4N), 
which Lucene.NET.ICU depends on, is also probably a work in progress. And it's 
certainly worth noting that ICU support is something the Java Lucene team got 
for free in the JDK that unfortunately isn't included in the .NET Framework 
(full or core). Hence the need to create ICU4N to provide that support. A 
nontrivial endeavor in its own right.
   
   Yes, ICU4N is still a work in progress. There are several tests that either 
still fail, often due to gaps that we haven't yet covered. There are also some 
concurrency bugs to track down. Since it is only a partial port, we have lots 
of tests to go through that might be able to be ported, as well. The intention 
is not to port any more of the production code (except for perhaps some of the 
formatters and parsers because that is where most of its funding has come from 
so far).
   
   The ICU4J functionality is not in the JDK. Instead ICU4N is a port of ICU4J. 
But it is hard to integrate because the gap between Java and ICU4J is not the 
same as the gap between .NET and ICU4N. Although, it is made easier because ICU 
is documented pretty well.
   
   In short ICU4/J extend the text processing capabilities of .NET and Java by 
providing rules-based versions of some of the included components (such as the 
`CompareInfo` .NET class which corresponds to the more powerful 
`RuleBasedCollator` in ICU4N). These components allow you to control the 
behavior in custom ways that simply can't be done on the raw .NET or JDK 
platforms. There are also many other features that are super valuable, such as 
the `UnicodeSet` which can be used like a regex character class but is much 
more powerful (it can even be passed a string to match all of the characters in 
a specific version of Unicode).
   
   We use the ICU4N `BreakIterator` in all cases where the JDK `BreakIterator` 
is required because .NET is totally lacking this feature (even though it 
depends on ICU now, the API for this is not exposed anywhere). This has also 
caused some compatibility issues because of differences between how ICU4J and 
the JDK behave, so we had to patch the `ThaiAnalyzer` and basically write our 
own tests for some of the highlighters. Unfortunately, the highlighters won't 
work exactly the same unless we do the research to work out what to recommend 
as the "JDK format" by providing custom rules that correspond to the Java 
behavior.
   
   > But when I review the docs for 
[Lucene.Net.ICU](https://lucenenet.apache.org/docs/4.8.0-beta00016/api/icu/overview.html)
 and see what's included, it feels very central to a search library and 
encompasses such basic functionality as finding word boundaries and line break 
boundaries. While this seems trivial in languages like English it's anything 
but trivial in languages like Chinese 要弄清楚如何分解中文單字是很困難的。or Japanese 
中国語で単語を区切る方法を理解するのは難しいです.
   >
   > Given that a great many of the developers using Lucene.NET only use it for 
English text, or other languages that use the Latin alphabet, it's easy to see 
how we can sometimes lose sight of what ICU is and why it's so important. Based 
on your post, I now better understand why Lucene.NET hasn't had a public 
release yet. Still, it seems very unfortunate that such a stable product (at 
least for indexing Latin languages) has a current version (beta) that doesn't 
indicate it's production-ready for Latin languages.
   
   Actually, there are several use cases that even make it valuable even to 
Western European languages. For example, for removing diacritics from words. In 
.NET, this cannot be done without [a 
hack](https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net)
 because the normalization feature is missing the case fold option that ICU 
has. I have seen many people post this hack in their questions about Lucene.NET 
even though they could just use the `ICUFoldingFilter` or 
`ICUNormalizer2Filter` instead.
   
   These make it so words with accent characters such as resume, résumé, and 
resumé all normalize to the same root word for searches.
   
   Although the components inside of the `Lucene.Net.ICU` assembly are indeed 
valuable as is, the real value is in using ICU4N to build custom analysis 
components.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to