=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote: > > >> Before discussing the escape, I'd like to see a specification of > >> it first - what characters precisely would classify as "printing"? > > > > For basic ASCII and locale-based testing, whatever isprint() says. > > Just as for isalpha(). > > In the mediate term, locale-based testing will go away/be not > implementable (in particular, Py3k won't have a byte-oriented > character string type, so we can't use isprint). In general, > isprint is unsuitable since it doesn't support multi-byte > character sets.
Well, iswprint isn't so restricted :-) I don't see the relevance of this, as EXACTLY the same problem applies to isalnum and \w. If you can solve one problem (and you have to solve the latter), you can solve the other. > > For Unicode, whatever people agree! I use the criterion that it > > has a defined category that doesn't start with 'C' - which is what > > I think that most people will accept. > > -1. There must be a better specification than that. > > Can you please explain the concept of "printing character"? If > you have a Unicode code point, how do you determine whether it > is printing? If rendering it would generate black pixels on white > background? Eh? This is a character set we are talking about. The proposed extensions to include font and colour are an aberration that I shall thankfully be long retired before they hit. Unicode has a two letter classification of each character, with the main category being in upper case and the subsidiary one in lower. Let's ignore the latter, as it is irrelevant. The main categories are 'Z' (spaces), 'L' (letters), 'N' (numbers), 'S' (Symbols), 'P' (punctuation), 'M' (marks) and 'C' control characters. There are some pretty weird entries in 'L' and 'N' and the difference between 'S', P' and 'M' is arcane, to a degree. But all of the categories except 'C' are things that display, and 'C' is mainly the ASCII controls we know and, er, love - with some similar extras. Obviously, unclassified characters should not be called printing, and equally obviously controls shouldn't. There is no clear reason why the others should not be - especially as the difference between a modifying accent and a free-standing one is something so obscure that most people don't even know that there IS one. The point about an escape for printing characters is to check for bad characters in text input, and the rule I mentioned is fine for that. What's the problem with it? Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QH, England. Email: [EMAIL PROTECTED] Tel.: +44 1223 334761 Fax: +44 1223 334679 _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com