On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
04-Jan-2013 21:48, monarch_dodra пишет:

I finished an implementation:

https://github.com/D-Programming-Language/phobos/pull/1052

It is not "pull ready", so we can still discuss it.


Well, for start it features tons of code duplication. But I'm replacing the whole std.uni anyway...

Well, I wrote that with duplication, keeping in mind you would
probably replace both. I thought it be cleaner to have some duplication, than a warped single implementation. I could also make the extra effort. I was really concerned with first having an implementation that is unicode correct.

I also though that, at worst, you could use my parsed data ;) to submit your own (superior?) pull.

* There's a couple characters in tableLo that have numeric values. These aren't considered in isNumber either. I think this might be a bug though. * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC SIGN". These return wild values, and in particular two of them return -1. I *think* this should actually return nan for us, because (AFAIK),
-1 is just wild for invalid :/

Some have numeric value of '-1' I think. The truth of the matter is as usual with Unicode things are rather complicated. So 'numeric character' is a category (general) and 'has numeric value' is some other property of codepoint that may or may not correlate directly with category.

Thus I think (looking ahead into your other post) that isNumber is correct as it follows its documented behavior.


Maybe we should just return -1 on invalid unicode? Or maybe it's just my
input file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
It doesn't have a separate field for isNumber/numericValue, so it is forced to write a wild number. Maybe these four chars should return nan?

Nope. Does letter 'A' return a wild number?


Well, the thing is that I'm getting contradictory info from the
consortium itself:
Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
According to the "UnicodeData.txt", its numeric value is -1.
According to The "Unocide utilities", it is not a numeric type,
and it's value is null:
http://unicode.org/cldr/utility/character.jsp?a=12456

Also according to the consortium: "-1" is an illegal numeric
value.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]

Really, all the info seems to indicate a bug in UnicodeData.txt:
They really seem like 4 entries in Nl that aren't numbers.

I've found a couple people on internet discussing this, but no
hard conclusion :/

****

Anyways, those 4 CUNEIFORM asside, what do you make of the
entries in Lo:
http://unicode.org/cldr/utility/character.jsp?a=F96B
These appear to be numeric, but aren't inside Nd/No/Nl. They
should return true to isNumber, no?

Maybe isNumber's "documented behavior" is wrong?

Reply via email to