Re: [Python-Dev] Regular expressions, Unicode etc.

Nick Maclaren Thu, 09 Aug 2007 02:15:31 -0700

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote:
>
> There is no problem for isalnum: it will just go away if
> byte-oriented characters go away. Fortunately, we have a
> replacement for the Unicode case.


As we do for isprint.

> The relevance is that your specification of "printing character"
> as "isprint returns true" is nearly useless, as it only applies
> to byte-oriented characters.

Eh?  That's ALL I used it to specify!  I used a Unicode-based
specification for Unicode.

> Unicode-isalnum is defined as isalpha|isdecimal|isdigit|isnumeric.
> isalpha means categories Ll, Lu, Lt, Lo, Lm. isdecimal means
> character has the decimal property. isigit means the character has
> the digit property. isnumeric means the character has the numeric
> property.

I sincerely hope it isn't!

Using a mixture of categories and properties is truly horrible,
because it isn't unlikely that some future version of Unicode will
introduce anomalies, even if there aren't any there already.  And
the character aliases file doesn't include any properties called
'digit' or 'decimal' or anything much like them, so they need a
painful amount of reverse engineering to determine what characters
they bind to.  It LOOKS as if they are the subcategories, which
would be OK.

A much cleaner and more future-proof specification would be any
category beginning with 'L' or 'N'.  For example, Unicode doesn't
CURRENTLY have a category for indeterminate numbers or sacred
case, such as are used in some languages, but it isn't implausible
that it would add them :-)

> It was a proposal for a definition. English is not my native
> language, and "printing character" means nothing to me. So
> I kindly asked for a definition, and suggested one possibility.
> I would not have guessed that you consider white-space characters
> as "printing", as they don't actually print anything.

Ah.  It's not an ordinary English term.  It's a computer language
one, so I assumed that you would know it.

It is older than C, but C standardised its use to mean any of the
characters which are intended to display (or leave a blank) with
standard, single positioning semantics.  Almost all languages
derived from C use it in the same sense, and Python has a fair
amount of C ancestry.

> The problem is that you did not quite mention a rule, or else
> I missed it.

I did, and you did!  I said that it should be any character with
a defined category that is not 'control'.

> You seem to be asking for being able to express "not a control
> character". I propose that this is best done with UTS#18,
> in which you would write
> 
>   [\P{C}] # or \P{Other}
>
> If this is what you want, I'm all in favor of having it
> implemented.

Excellent!  We are agreed.  Yes, that is equivalent.

I am NOT volunteering to add the support of that to the parser,
especially now I have discovered the format of the intermediate
data :-(  It would be a foul task, and it isn't clear what syntax
to use, anyway.

There is the horrible POSIX syntax, which I blame (perhaps wrongly)
on HP-UX, and the Java one, which I believe is a modified subset
of the example in UTS#8.  But that says:

    All syntax and API presented in this document is only for the
    purpose of illustration; there is absolutely no requirement to
    follow such syntax or API.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  [EMAIL PROTECTED]
Tel.:  +44 1223 334761    Fax:  +44 1223 334679
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Regular expressions, Unicode etc.

Reply via email to