On Friday, May 2, 2014 4:08:35 AM UTC+5:30, Terry Reedy wrote: > On 5/1/2014 2:04 PM, Rustom Mody wrote:
> >>> Since its Unicode-troll time, here's my contribution > >>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html > I will not comment on the Unix-assumption part, but I think you go wrong > with this: "Unicode is a Headache". The major headache is that unicode > and its very few encodings are not universally used. The headache is all > the non-unicode legacy encodings still being used. So you better title > this section 'Non-Unicode is a Headache'. > The first sentence is this misleading tautology: "With ASCII, data is > ASCII whether its file, core, terminal, or network; ie "ABC" is > 65,66,67." Let me translate: "If all text is ASCII encoded, then text > data is ASCII, whether ..." But it was never the case that all text was > ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe > still uses the latter. Other mainframe makers used other encodings of > A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never > universal. You could have just as well said "With EBCDIC, data is > EBCDIC, whether ..." > https://en.wikipedia.org/wiki/Ascii > https://en.wikipedia.org/wiki/EBCDIC > A crucial step in the spread of Ascii was its use for microcomputers, > including the IBM PC. The latter was considered a toy by the mainframe > guys. If they had known that PCs would partly take over the computing > world, they might have suggested or insisted that the it use EBCDIC. > "With unicode there are: > encodings" > where 'encodings' is linked to > https://en.wikipedia.org/wiki/Character_encodings_in_HTML > If html 'always' used utf-8 (like xml), as has become common but not > universal, all of the problems with *non-unicode* character sets and > encodings would disappear. The pre-unicode declarations could then > disappear. More truthful: "without unicode there are 100s of encodings > and with unicode only 3 that we should worry about. > "in-memory formats" > These are not the concern of the using programmer as long as they do not > introduce bugs or limitations (as do all the languages stuck on UCS-2 > and many using UTF-16, including old Python narrow builds). Using what > should generally be the universal transmission format, UFT-8, as the > internal format means either losing indexing and slicing, having those > operations slow from O(1) to O(len(string)), or adding an index table > that is not part of the unicode standard. Using UTF-32 avoids the above > but usually wasted space -- up to 75%. > "strange beasties like python's FSR" > Have you really let yourself be poisoned by JMF's bizarre rants? The FSR > is an *internal optimization* that benefits most unicode operations that > people actually perform. It uses UTF-32 by default but adapts to the > strings users create by compressing the internal format. The compression > is trivial -- simple dropping leading null bytes common to all > characters -- so each character is still readable as is. The string > headers records how many bytes are left. Is the idea of algorithms that > adapt to inputs really strange to you? > Like good adaptive algorthms, the FSR is invisible to the user except > for reducing space or time or maybe both. Unicode operations are > otherwise the same as with previous wide builds. People who used to use > narrow-builds also benefit from bug elimination. The only 'headaches' > involved might have been those of the developers who optimized previous > wide builds. > CPython has many other functions with special-case optimizations and > 'fast paths' for common, simple cases. For instance, (some? all?) number > operations are optimized for pairs of integers. Do you call these > 'strange beasties'? Here is an instance of someone who would like a certain optimization to be dis-able-able https://mail.python.org/pipermail/python-list/2014-February/667169.html To the best of my knowledge its nothing to do with unicode or with jmf. Why if optimizations are always desirable do C compilers have: -O0 O1 O2 O3 and zillions of more specific flags? JFTR I have no issue with FSR. What we have to hand to jmf - willingly or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them] I dont even know whether jmf has a real technical (as he calls it 'mathematical') issue or its entirely political: "Why should I pay more for a EURO sign than a $ sign?" Well perhaps that is more related to the exchange rate than to python! -- https://mail.python.org/mailman/listinfo/python-list