=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote: > > There is no problem for isalnum: it will just go away if > byte-oriented characters go away. Fortunately, we have a > replacement for the Unicode case.
As we do for isprint. > The relevance is that your specification of "printing character" > as "isprint returns true" is nearly useless, as it only applies > to byte-oriented characters. Eh? That's ALL I used it to specify! I used a Unicode-based specification for Unicode. > Unicode-isalnum is defined as isalpha|isdecimal|isdigit|isnumeric. > isalpha means categories Ll, Lu, Lt, Lo, Lm. isdecimal means > character has the decimal property. isigit means the character has > the digit property. isnumeric means the character has the numeric > property. I sincerely hope it isn't! Using a mixture of categories and properties is truly horrible, because it isn't unlikely that some future version of Unicode will introduce anomalies, even if there aren't any there already. And the character aliases file doesn't include any properties called 'digit' or 'decimal' or anything much like them, so they need a painful amount of reverse engineering to determine what characters they bind to. It LOOKS as if they are the subcategories, which would be OK. A much cleaner and more future-proof specification would be any category beginning with 'L' or 'N'. For example, Unicode doesn't CURRENTLY have a category for indeterminate numbers or sacred case, such as are used in some languages, but it isn't implausible that it would add them :-) > It was a proposal for a definition. English is not my native > language, and "printing character" means nothing to me. So > I kindly asked for a definition, and suggested one possibility. > I would not have guessed that you consider white-space characters > as "printing", as they don't actually print anything. Ah. It's not an ordinary English term. It's a computer language one, so I assumed that you would know it. It is older than C, but C standardised its use to mean any of the characters which are intended to display (or leave a blank) with standard, single positioning semantics. Almost all languages derived from C use it in the same sense, and Python has a fair amount of C ancestry. > The problem is that you did not quite mention a rule, or else > I missed it. I did, and you did! I said that it should be any character with a defined category that is not 'control'. > You seem to be asking for being able to express "not a control > character". I propose that this is best done with UTS#18, > in which you would write > > [\P{C}] # or \P{Other} > > If this is what you want, I'm all in favor of having it > implemented. Excellent! We are agreed. Yes, that is equivalent. I am NOT volunteering to add the support of that to the parser, especially now I have discovered the format of the intermediate data :-( It would be a foul task, and it isn't clear what syntax to use, anyway. There is the horrible POSIX syntax, which I blame (perhaps wrongly) on HP-UX, and the Java one, which I believe is a modified subset of the example in UTS#8. But that says: All syntax and API presented in this document is only for the purpose of illustration; there is absolutely no requirement to follow such syntax or API. Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QH, England. Email: [EMAIL PROTECTED] Tel.: +44 1223 334761 Fax: +44 1223 334679 _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com