Re: numericValue for (unicode) characters

monarch_dodra Fri, 04 Jan 2013 12:55:14 -0800

On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:

04-Jan-2013 21:48, monarch_dodra пишет:
I finished an implementation:

https://github.com/D-Programming-Language/phobos/pull/1052

It is not "pull ready", so we can still discuss it.
Well, for start it features tons of code duplication. But I'mreplacing the whole std.uni anyway...


Well, I wrote that with duplication, keeping in mind you would

probably replace both. I thought it be cleaner to have someduplication, than a warped single implementation. I could alsomake the extra effort. I was really concerned with first havingan implementation that is unicode correct.

I also though that, at worst, you could use my parsed data ;) tosubmit your own (superior?) pull.

* There's a couple characters in tableLo that have numericvalues. Thesearen't considered in isNumber either. I think this might be abug though.* There are 4 "non-number numeric" characters in "CUNEIFORMNUMERICSIGN". These return wild values, and in particular two of themreturn-1. I *think* this should actually return nan for us, because(AFAIK),
-1 is just wild for invalid :/
Some have numeric value of '-1' I think. The truth of thematter is as usual with Unicode things are rather complicated.So 'numeric character' is a category (general) and 'has numericvalue' is some other property of codepoint that may or may notcorrelate directly with category.
Thus I think (looking ahead into your other post) that isNumberis correct as it follows its documented behavior.
Maybe we should just return -1 on invalid unicode? Or maybeit's just my
input file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
It doesn't have a separate field for isNumber/numericValue, soit isforced to write a wild number. Maybe these four chars shouldreturn nan?
Nope. Does letter 'A' return a wild number?


Well, the thing is that I'm getting contradictory info from the
consortium itself:
Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
According to the "UnicodeData.txt", its numeric value is -1.
According to The "Unocide utilities", it is not a numeric type,
and it's value is null:
http://unicode.org/cldr/utility/character.jsp?a=12456

Also according to the consortium: "-1" is an illegal numeric
value.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]

Really, all the info seems to indicate a bug in UnicodeData.txt:
They really seem like 4 entries in Nl that aren't numbers.

I've found a couple people on internet discussing this, but no
hard conclusion :/

****

Anyways, those 4 CUNEIFORM asside, what do you make of the
entries in Lo:
http://unicode.org/cldr/utility/character.jsp?a=F96B
These appear to be numeric, but aren't inside Nd/No/Nl. They
should return true to isNumber, no?

Maybe isNumber's "documented behavior" is wrong?

Re: numericValue for (unicode) characters

Reply via email to