Guido van Rossum wrote: > On 9/20/06, Michael Chermside <[EMAIL PROTECTED]> wrote: >> I wrote: >>>>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.' >>>>> msg[35:-18] >> u'"\U00010143"' >>>>> greek_five = msg[36:-19] >>>>> len(greek_five) >> 2 >> >> >> After posting, I realized that it's worse than that. I suspect that if >> I tried this on a CPython compiled with wide characters, then >> len(greek_five) would be 1. >> >> What should it be? 2? 1? Implementation-dependent? > > This has all been rehashed endlessly. It's implementation (and > platform- and compilation options-) dependent because there are good > reasons for both choices.
while i understand the constraints, i think it's not a good decision to leave this to be implementation-dependent. the strings seem to me as such a basic functionality, that it's behaviour should not depend on the platform. for example, how is an application developer then supposed to write their applications? should he write his own slicing/whatever functions to get consistent behaviour on linux/windows? i think this is not just a 'theoretical' issue. it's a very practical issue. the only reason why it does not seem to be important, because currently not much of the non-16-bit unicode characters are used. (and this situation seems to be quite similar to that one, when only 8byte-characters were used :-) btw. an idea: ============== maybe this 'problem' should be separated into 2 issues: 1. representation of the unicode string (utf-16 or utf-32) 2. behaviour of the unicode strings in python-3000 of course there are some dependencies between them. (mostly the performance of #2) so why don't we make the *behaviour* cross-platform, and the *performance characteristics* and the *representation* platform-dependent? (means that jython/ironpython could use utf-16, but would slice strings slower (because of the surrogate-issues)) ================ > Even if CPython 3.0 supports a dynamic > choice (which some are proposing) then the *language* will still make > it implementation dependent because of Jython and IronPython, where > the only choice is UTF-16 (or UCS-2, depending the attitude towards > surrogates). > i don't see why there should be the only choice utf-16. it's the obvious/most-convenient choice for jython/ironpython, that's correct. but (correct me if i'm wrong), ironPython or jython could support utf-32 characters. but it of course would mean that they could not use the 'platform''s string for their string handling. but the same way i could say, that because most of the unix-world is utf-8, for those pythons the best way is to handle it internally as utf-8, couldn't i? it simply seems to me strange to make compromises that makes the life of the cpython-users harder, just to make the life for the jython/ironpython developers (i mean the 'creators') easier. gabor _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com