Yes, I got this email, but not the original reply from Christopher - thanks for keeping me in the loop. I added the [email protected] email to my safe senders list a couple of days ago but apparently that wasn't sufficient, so I have added both of your email addresses on this email and Christopher's as well - hopefully that will suffice.
If you have replied to any of my other emails, please forward them to me once again (I didn't get any replies). And yes, another snag I realized about our current incarnation of ICU is that there is no way to strong name Analysis.Common because ICU4NET is not strong named. If memory serves correctly, it is not possible to strong name assemblies that depend on unmanaged resources - if that is true, it is a strong case to eliminate ICU4NET sooner rather than later. Strong named assemblies cannot depend on assemblies that are not strong named, so it is pretty critical for an open source project to support strong naming for those projects that require it. I took a look at icu-dotnet and it looks like it has all of the pieces we need for Collation (at first glance anyway), however the "BreakIterator" in there is a static class (that is not even inherited from IEnumerator) that doesn't do what we need. What we need is an Enumerator that works out how to determine one word from the next in Thai - a language that doesn't use spaces to delineate words. Their "BreakIterator" uses spaces and/or punctuation to determine word breaks. I haven't looked under the hood much on ICU4NET, but I know separating Thai words programmatically is a very complex problem. If there are no other working BreakIterator implementations available for .NET, it seems like the next best option to get a .NET core-compatible working implementation would be to port it from Java. That being said, in terms of importance the Collation is much higher because it is a cross-culture feature. BreakIterator is used in Lucene for the Thai analyzer and for the text highlighter for all other languages. In a pinch we could get away with breaking on spaces and punctuation for cross-culture support in the highlighter and simply not supporting Thai (either for the highlighter or Analyzer) for .NET core. This seems like the most reasonable tradeoff given how difficult it will be (or rather how many man hours it will take to get there) to support Thai in .NET core as well as how low on the totem pole Thai is in terms of world languages (and I am sitting in Thailand as I write this). Perhaps it might even make sense to make the Analysis.Th namespace and other parts that support BreakIterator (such as the text highlighter) for Thai into its own .NET NuGet package for .NET 4.6 so the BreakIterator dependency can be isolated to just that package, and the rest can then compile and deploy in both .NET 4.6 and .NET core with a stripped down version of BreakIterator that works in most languages other than Thai. Frankly, I would personally like to see Lucene.Net 4.8 released before Lucene 7.0 is released rather than having everyone bend over backwards to try to fit Thai language support into .NET core/Azure. > Is this something that we should wait for so that the migration of the > Collation namespace is a more direct port, or should we go ahead with > trying to use the .NET classes? I just want to make sure that we are > not changing the internal workings of these classes so much that they > don't work the same as their Java counterparts. The piece that I kept > getting hung up on was the RuleBasedCollator which icu-dotnet has a > direct port of (along with Collator and Locale). I'd say first set a reference to icu-dotnet and see if you can get all of the collator tests to pass. If so, then bring the classes over into our Support namespace. If not, then continue down the path I started - I think there are about 5 or 6 more dependencies that will be required to get all of the Collation pieces from Java (that is, if the Enumerator I mentioned before doesn't work), and there should be some way to plug in your own implementation of RuleBasedCollator on a per-locale basis. My thought was to port it over as a mostly Java style implementation, get the tests passing, and then start swapping out the pieces like the SortKey and (possibly) subclassing CompareInfo. Either way, I think that bringing/porting the code over into Lucene.Net is a better option than setting a reference to a library so we have better control over how .NET core-compatible the code is and so we don't take on another dependency. Thanks, Shad Storhaug/NightOwl888 -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Itamar Syn-Hershko Sent: Tuesday, September 6, 2016 9:10 PM To: [email protected] Subject: Re: Collation Just a heads up - I tried reaching out to Shad privately and mails to him bounce.. hopefully he can see this :) Collation and ICU both sound quite painful - would love to see us reducing our dependencies on that front, I already got reports of our current ICU deps not playing along with Azure -- Itamar Syn-Hershko http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer & Consultant Lucene.NET committer and PMC member On Sun, Sep 4, 2016 at 10:30 PM, Christopher Haws <[email protected]> wrote: > @NightOwl888 > > No problem. I had a pretty busy week at work so I wasn't able to work > on it during the week. I came to the same conclusions as you regarding > CompareInfo, SortKey, and CultureInfo being .NET's closest equivalent > to Java's Collation and Locale classes. > > Something that I did find while looking through the dev mailing list > is that Connie Yau, from Microsoft, has replaced ICU4NET with > icu-dotnet in their port to .NET Core. > > http://mail-archives.apache.org/mod_mbox/lucenenet-dev/201605.mbox/% > 3CCY1PR0301MB0761AE82FE1401AD03CB36E4B84B0%40CY1PR0301MB0761.namprd03. > prod.outlook.com%3E > > https://github.com/conniey/lucenenet > > Is this something that we should wait for so that the migration of the > Collation namespace is a more direct port, or should we go ahead with > trying to use the .NET classes? I just want to make sure that we are > not changing the internal workings of these classes so much that they > don't work the same as their Java counterparts. The piece that I kept > getting hung up on was the RuleBasedCollator which icu-dotnet has a > direct port of (along with Collator and Locale). > > icu-dotnet: https://github.com/sillsdev/icu-dotnet > > Let me know what you think. > > Thanks! > Christopher Haws >
