Re: numericValue for (unicode) characters

Dmitry Olshansky Thu, 10 Jan 2013 10:10:25 -0800

10-Jan-2013 03:21, H. S. Teoh пишет:

On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:

On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:

[...]
I, for one, would love to know why isNumeric != hasNumericValue.

[...]

I guess it's just bad wording from the standard.


The standard defined 3 groups that make up Number:
[Nd]    Number, Decimal Digit
[Nl]    Number, Letter
[No]    Number, Other

However, there are a couple of characters that *are* numbers, but
aren't in those goups.

The "Good" news is that the standard, *does* define number_types to
classify the kind of number a char is:
* Null: Not a number
* Digit: Obvious
* Decimal: Any decimal number that is NOT a digit
* Numeric: Everything else.

So they used "Numeric" as wild, and "Number" as their general
category.

This leaves us with ambiguity when choosing our word:
Technically '5' does not clasify as "numeric", although you could
consider it "has a numeric value".

I hope that makes sense.


Hmph. I guess we need to differentiate between the unicode category
called "numeric", and the property of having a numerical value. So we'd
need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's
what the standard is, then that's what it is.


isNumber - _Number_ General category (as defined by Unicode 1:1)

isNumeric - as having NumericType != None (again going be definition ofUnicode properties)


And that's all, correct and to the latter.


Anyway, I'd love to see std.uni cover all unicode categories.

Offhanded note: should we unify the various isX() functions into:

        bool inCategory(string category)(dchar ch)

No, no, no! It's a horrible idea. The main problem with it is: hugecatalog of data has to be stored in Phobos (object code) of no (evenniche) use. Also to be practical for use cases other then casualobservation it has to be fast.. and it can't for any of the useful cases.

Just count the number of bits to store per codepoint and fairlyirregular structure of the whole set of properties (unlike individualcombinations that do have nice distribution e.g. Scripts as in Cyrillic).

I've been shoulder-deep in Unicode for about half a year now, andreading through TR-xx algorithms and *none* of them requires queries ofthe sort that tests all (more then 1-2?) of properties.

In all cases the algorithm itself defines a set(s) of codepoints withdifferent meanings/values for this use case. These (useful) sets couldbe compressed to a fast multi-stage table, the whole catalog ofproperties - no, as it packs enormous heaps of unused junk (Unicode_Ageanyone??). This junk is not fit for std library but the goal is toprovide tool for the user to work with sets/data beyond the commonlyuseful in std.

where category is the Unicode designation, say "Nl", "Nd", etc.? That
way, it's more future-proof in case the Unicode guys add more
categories.


I'm posting my work on std.uni as ready for review today or tomorrow.

It includes a type for a set of codepoints and ton of predefined setsfor Nl, Nd and almost everything sensible (blocks, scripts, properties).

The user can then conjure whatever combination required.

And it still way smaller then having full 'query the database' thing. Tocheck the full madness of all of the properties just use the webinterface of unicode.org.

P.S. Hopefully, nobody rises the point of codepoint _names_ they areafter all too part of Unicode standard (and character database).


--
Dmitry Olshansky

Re: numericValue for (unicode) characters

Reply via email to