On Saturday, 13 December 2014 at 15:44:59 UTC, Sean Kelly wrote:
On Friday, 12 December 2014 at 17:57:41 UTC, Trent Forkert wrote:

I've looked into writing a binding for ICU recently, but ultimately decided to abandon that idea in favor of writing a replacement for it in D.

Wow... really? You're actually going to write transcoders for all available encodings? Plus the conversion and parsing tools, plus expand our calendar functionality to handle the things it doesn't do now, plus... I mean I'd love it, but the scope of the project can be measured in tens of man-years.

Running down the icu4c API listing:

* Basic Types and Constants - only as needed
* Strings and character iteration - Just use D strings, std.string
* Unicode character properties and names - I think std.uni handles this
* Sets of Unicode Code Points and Strings - ditto
* Codepage conversion - ignoring, at least for now. See below.
* Unicode text compression - again, I think std.uni handles this
* Locales - yes
* Resource Bundles - will offer equivalent functionality, just not identical
* Normalization - std.uni
* Calendars - see below
* Date and time formatting - yes
* Message formatting - yes
* Number formatting / spell-out - yes
* Transliteration - yes, but may be delayed until after initial release
* Bidirectional Algorithm - not at first, is this in std.uni?
* Arabic shaping - not at first, is this in std.uni?
* Collation - I'm delaying this until after the initial release to get it out faster
* String searching - depends on Collation
* Index characters - depends on Collation
* Text Boundary analysis - depends on Collation
* Regular Expression - use std.regex
* StringPrep - not initially, is this in std.uni?
* IDNA - not initially, is this in Phobos?
* Identifier spoofing and confusability - not initially
* Layout engine - delayed, looks like ICU is removing this and pointing to another library
* Universal Time Scale - see below
* ICU I/O - use phobos

There are very few things above that are not possible to generate from CLDR data. Of those, most are RFC-defined algorithms, several of which I believe are already part of Phobos.

If I add codepage conversion, it will likely be in terms of iconv on POSIX and MultiByteToWideChar and friends on Windows. Alternatively, I could "borrow" the IBM CDRA/UCM data the way I'm getting almost everything else from CLDR data.

Support of other calendar systems is up in the air at the moment. I had thought CLDR contained what I needed, but it looks like it might not. It has locale-specific formatting and display info for calendars, and mappings to when other calendar's eras begin in terms of the Gregorian calendar, but I don't see further breakdown of information. So, initially it looks like I'll only be supporting Gregorian calendar, but I may add the others in the future.

It is a lot of work, yes, but the Unicode Consortium already does a significant chunk of it with CLDR.

 - Trent

Reply via email to