On Saturday, 13 December 2014 at 15:44:59 UTC, Sean Kelly wrote:
On Friday, 12 December 2014 at 17:57:41 UTC, Trent Forkert
wrote:
I've looked into writing a binding for ICU recently, but
ultimately decided to abandon that idea in favor of writing a
replacement for it in D.
Wow... really? You're actually going to write transcoders for
all available encodings? Plus the conversion and parsing tools,
plus expand our calendar functionality to handle the things it
doesn't do now, plus... I mean I'd love it, but the scope of
the project can be measured in tens of man-years.
Running down the icu4c API listing:
* Basic Types and Constants - only as needed
* Strings and character iteration - Just use D strings, std.string
* Unicode character properties and names - I think std.uni
handles this
* Sets of Unicode Code Points and Strings - ditto
* Codepage conversion - ignoring, at least for now. See below.
* Unicode text compression - again, I think std.uni handles this
* Locales - yes
* Resource Bundles - will offer equivalent functionality, just
not identical
* Normalization - std.uni
* Calendars - see below
* Date and time formatting - yes
* Message formatting - yes
* Number formatting / spell-out - yes
* Transliteration - yes, but may be delayed until after initial
release
* Bidirectional Algorithm - not at first, is this in std.uni?
* Arabic shaping - not at first, is this in std.uni?
* Collation - I'm delaying this until after the initial release
to get it out faster
* String searching - depends on Collation
* Index characters - depends on Collation
* Text Boundary analysis - depends on Collation
* Regular Expression - use std.regex
* StringPrep - not initially, is this in std.uni?
* IDNA - not initially, is this in Phobos?
* Identifier spoofing and confusability - not initially
* Layout engine - delayed, looks like ICU is removing this and
pointing to another library
* Universal Time Scale - see below
* ICU I/O - use phobos
There are very few things above that are not possible to generate
from CLDR data. Of those, most are RFC-defined algorithms,
several of which I believe are already part of Phobos.
If I add codepage conversion, it will likely be in terms of iconv
on POSIX and MultiByteToWideChar and friends on Windows.
Alternatively, I could "borrow" the IBM CDRA/UCM data the way I'm
getting almost everything else from CLDR data.
Support of other calendar systems is up in the air at the moment.
I had thought CLDR contained what I needed, but it looks like it
might not. It has locale-specific formatting and display info for
calendars, and mappings to when other calendar's eras begin in
terms of the Gregorian calendar, but I don't see further
breakdown of information. So, initially it looks like I'll only
be supporting Gregorian calendar, but I may add the others in the
future.
It is a lot of work, yes, but the Unicode Consortium already does
a significant chunk of it with CLDR.
- Trent