Yes, I got this email, but not the original reply from Christopher - thanks for 
keeping me in the loop. I added the [email protected] email to my safe 
senders list a couple of days ago but apparently that wasn't sufficient, so I 
have added both of your email addresses on this email and Christopher's as well 
- hopefully that will suffice.

If you have replied to any of my other emails, please forward them to me once 
again (I didn't get any replies).

And yes, another snag I realized about our current incarnation of ICU is that 
there is no way to strong name Analysis.Common because ICU4NET is not strong 
named. If memory serves correctly, it is not possible to strong name assemblies 
that depend on unmanaged resources - if that is true, it is a strong case to 
eliminate ICU4NET sooner rather than later. Strong named assemblies cannot 
depend on assemblies that are not strong named, so it is pretty critical for an 
open source project to support strong naming for those projects that require it.

I took a look at icu-dotnet and it looks like it has all of the pieces we need 
for Collation (at first glance anyway), however the "BreakIterator" in there is 
a static class (that is not even inherited from IEnumerator) that doesn't do 
what we need. What we need is an Enumerator that works out how to determine one 
word from the next in Thai - a language that doesn't use spaces to delineate 
words. Their "BreakIterator" uses spaces and/or punctuation to determine word 
breaks. I haven't looked under the hood much on ICU4NET, but I know separating 
Thai words programmatically is a very complex problem. If there are no other 
working BreakIterator implementations available for .NET, it seems like the 
next best option to get a .NET core-compatible working implementation would be 
to port it from Java.

That being said, in terms of importance the Collation is much higher because it 
is a cross-culture feature. BreakIterator is used in Lucene for the Thai 
analyzer and for the text highlighter for all other languages. In a pinch we 
could get away with breaking on spaces and punctuation for cross-culture 
support in the highlighter and simply not supporting Thai (either for the 
highlighter or Analyzer) for .NET core. This seems like the most reasonable 
tradeoff given how difficult it will be (or rather how many man hours it will 
take to get there) to support Thai in .NET core as well as how low on the totem 
pole Thai is in terms of world languages (and I am sitting in Thailand as I 
write this). Perhaps it might even make sense to make the Analysis.Th namespace 
and other parts that support BreakIterator (such as the text highlighter) for 
Thai into its own .NET NuGet package for .NET 4.6 so the BreakIterator 
dependency can be isolated to just that package, and the rest can then compile 
and deploy in both .NET 4.6 and .NET core with a stripped down version of 
BreakIterator that works in most languages other than Thai.

Frankly, I would personally like to see Lucene.Net 4.8 released before Lucene 
7.0 is released rather than having everyone bend over backwards to try to fit 
Thai language support into .NET core/Azure.

> Is this something that we should wait for so that the migration of the 
> Collation namespace is a more direct port, or should we go ahead with 
> trying to use the .NET classes? I just want to make sure that we are 
> not changing the internal workings of these classes so much that they 
> don't work the same as their Java counterparts. The piece that I kept 
> getting hung up on was the RuleBasedCollator which icu-dotnet has a 
> direct port of (along with Collator and Locale).

I'd say first set a reference to icu-dotnet and see if you can get all of the 
collator tests to pass. If so, then bring the classes over into our Support 
namespace. If not, then continue down the path I started - I think there are 
about 5 or 6 more dependencies that will be required to get all of the 
Collation pieces from Java (that is, if the Enumerator I mentioned before 
doesn't work), and there should be some way to plug in your own implementation 
of RuleBasedCollator on a per-locale basis. My thought was to port it over as a 
mostly Java style implementation, get the tests passing, and then start 
swapping out the pieces like the SortKey and (possibly) subclassing CompareInfo.

Either way, I think that bringing/porting the code over into Lucene.Net is a 
better option than setting a reference to a library so we have better control 
over how .NET core-compatible the code is and so we don't take on another 
dependency.

Thanks,
Shad Storhaug/NightOwl888

-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Itamar Syn-Hershko
Sent: Tuesday, September 6, 2016 9:10 PM
To: [email protected]
Subject: Re: Collation

Just a heads up - I tried reaching out to Shad privately and mails to him 
bounce.. hopefully he can see this :)

Collation and ICU both sound quite painful - would love to see us reducing our 
dependencies on that front, I already got reports of our current ICU deps not 
playing along with Azure

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance 
Developer & Consultant Lucene.NET committer and PMC member

On Sun, Sep 4, 2016 at 10:30 PM, Christopher Haws <[email protected]> wrote:

> @NightOwl888
>
> No problem. I had a pretty busy week at work so I wasn't able to work 
> on it during the week. I came to the same conclusions as you regarding 
> CompareInfo, SortKey, and CultureInfo being .NET's closest equivalent 
> to Java's Collation and Locale classes.
>
> Something that I did find while looking through the dev mailing list 
> is that Connie Yau, from Microsoft, has replaced ICU4NET with 
> icu-dotnet in their port to .NET Core.
>
> http://mail-archives.apache.org/mod_mbox/lucenenet-dev/201605.mbox/%
> 3CCY1PR0301MB0761AE82FE1401AD03CB36E4B84B0%40CY1PR0301MB0761.namprd03.
> prod.outlook.com%3E
>
> https://github.com/conniey/lucenenet
>
> Is this something that we should wait for so that the migration of the 
> Collation namespace is a more direct port, or should we go ahead with 
> trying to use the .NET classes? I just want to make sure that we are 
> not changing the internal workings of these classes so much that they 
> don't work the same as their Java counterparts. The piece that I kept 
> getting hung up on was the RuleBasedCollator which icu-dotnet has a 
> direct port of (along with Collator and Locale).
>
> icu-dotnet: https://github.com/sillsdev/icu-dotnet
>
> Let me know what you think.
>
> Thanks!
> Christopher Haws
>

Reply via email to