10-Jan-2013 03:21, H. S. Teoh пишет:
On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
[...]
I, for one, would love to know why isNumeric != hasNumericValue.
[...]
I guess it's just bad wording from the standard.

The standard defined 3 groups that make up Number:
[Nd]    Number, Decimal Digit
[Nl]    Number, Letter
[No]    Number, Other

However, there are a couple of characters that *are* numbers, but
aren't in those goups.

The "Good" news is that the standard, *does* define number_types to
classify the kind of number a char is:
* Null: Not a number
* Digit: Obvious
* Decimal: Any decimal number that is NOT a digit
* Numeric: Everything else.

So they used "Numeric" as wild, and "Number" as their general
category.

This leaves us with ambiguity when choosing our word:
Technically '5' does not clasify as "numeric", although you could
consider it "has a numeric value".

I hope that makes sense.

Hmph. I guess we need to differentiate between the unicode category
called "numeric", and the property of having a numerical value. So we'd
need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's
what the standard is, then that's what it is.

isNumber - _Number_ General category (as defined by Unicode 1:1)

isNumeric - as having NumericType != None (again going be definition of Unicode properties)

And that's all, correct and to the latter.


Anyway, I'd love to see std.uni cover all unicode categories.

Offhanded note: should we unify the various isX() functions into:

        bool inCategory(string category)(dchar ch)


No, no, no! It's a horrible idea. The main problem with it is: huge catalog of data has to be stored in Phobos (object code) of no (even niche) use. Also to be practical for use cases other then casual observation it has to be fast.. and it can't for any of the useful cases.

Just count the number of bits to store per codepoint and fairly irregular structure of the whole set of properties (unlike individual combinations that do have nice distribution e.g. Scripts as in Cyrillic).

I've been shoulder-deep in Unicode for about half a year now, and reading through TR-xx algorithms and *none* of them requires queries of the sort that tests all (more then 1-2?) of properties.

In all cases the algorithm itself defines a set(s) of codepoints with different meanings/values for this use case. These (useful) sets could be compressed to a fast multi-stage table, the whole catalog of properties - no, as it packs enormous heaps of unused junk (Unicode_Age anyone??). This junk is not fit for std library but the goal is to provide tool for the user to work with sets/data beyond the commonly useful in std.

where category is the Unicode designation, say "Nl", "Nd", etc.? That
way, it's more future-proof in case the Unicode guys add more
categories.

I'm posting my work on std.uni as ready for review today or tomorrow.
It includes a type for a set of codepoints and ton of predefined sets for Nl, Nd and almost everything sensible (blocks, scripts, properties).
The user can then conjure whatever combination required.

And it still way smaller then having full 'query the database' thing. To check the full madness of all of the properties just use the web interface of unicode.org.

P.S. Hopefully, nobody rises the point of codepoint _names_ they are after all too part of Unicode standard (and character database).

--
Dmitry Olshansky

Reply via email to