Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote: >> Who says there needs to be one. A good engineer will use the >> definition that is most appropriate to the task at hand. Some things >> need very solid definitions, and some things don’t. > > The the problem is solved: we have a perfectly good de facto definition > of character: it is a synonym for "code point", and every single one of > Marko's objections disappears.
I admit it. Python3 is the perfect medium for your codepoint delivery needs. What you don't seem to understand about my objections is that no programmer needs codepoints per se. Also, Python2's strings do as good a job at delivering codepoints as Python3. Simultaneously, Python2's strings are a better fit for the Unix system and network programming model. >> This goes back to my original point, where I said some people >> consider UTF-32 as a variable width encoding. For very many things, >> practically, the ‘codepoint’ isn’t the important thing, > > Ah, is this another one of those "let's pick a definition that nobody > else uses, and state it as a fact" like UTF-32 being variable width? Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value. <URL: https://en.wikipedia.org/wiki/UTF-32> That is called bijection. Even more, it's a homomorphism. Homomorphism is very high degree of sameness. It is essential for people to understand that the very same issues that plague UTF-8 plague UTF-32 as well. Using UTF in both highlights that fact. > If by "very many things", you mean "not very many things", I agree > with you. In my experience, dealing with code points is "good enough", > especially if you use Western European alphabets, and even more so if > you're willing to do a normalization step before processing text. Of course, UTF-8 doesn't relieve you from Unicode problems. But it has one big advantage: it can usually deal with non-Unicode data without any extra considerations while Python3's strings make you have to take elaborate measures to handle those special cases. Why, even print() must be guarded against UnicodeEncodeError when the printed string is not in the programmer's control. > But of course other people's experience may vary. I'm interested in > learning about the library you use to process graphemes in your software. For me, the issue is where do I produce a line break in my text output? Currently, I'm just counting codepoints to estimate the width of the output. Marko -- https://mail.python.org/mailman/listinfo/python-list