On Fri, Dec 21, 2012 at 7:20 AM, MRAB <pyt...@mrabarnett.plus.com> wrote: > On 2012-12-20 19:19, wxjmfa...@gmail.com wrote: >> The rule is to treat every character of a unique set of characters >> of a coding scheme in, how to say, an "equal way". The problematic >> can be seen the other way, every coding scheme has been built >> to work with a unique set of characters, otherwhile it is not >> properly working! >> > It's true that in an ideal world you would treat all codepoints the > same. However, this is a case where "practicality beats purity".
Actually no. Not all codepoints are the same. Ever heard of Huffman coding? It's a broad technique used in everything from PK-ZIP/gzip file compression to the Morse code ("here come dots!"). It exploits and depends on a dramatically unequal usage distribution pattern, as all text (he will ask "All?" You will respond "All!" He will understand -- referring to Caeser) exhibits. In the case of strings in a Python program, it's fairly obvious that there will be *many* that are ASCII-only; and what's more, most of the long strings will either be ASCII-only or have a large number of non-ASCII characters. However, your microbenchmarks usually look at two highly unusual cases: either a string with a huge number of ASCII chars and one non-ASCII, or all the same non-ASCII (usually for your replace() tests). I haven't seen strings like either of those come up. Can you show us a performance regression in an *actual* *production* *program*? And make sure you're comparing against a wide build, here. ChrisA -- http://mail.python.org/mailman/listinfo/python-list