On Sunday, 9 March 2014 at 21:38:06 UTC, Nick Sabalausky wrote:
On 3/9/2014 7:47 AM, w0rp wrote:
My knowledge of Unicode pretty much just comes from having
to deal with foreign language customers and discovering the
problems
with the code unit abstraction most languages seem to use.
(Java and
Python suffer from similar issues, but they don't really have
algorithms
in the way that we do.)
Python 2 or 3 (out of curiosity)? If you're including Python3,
then that somewhat surprises me as I thought greatly improved
Unicode was one of the biggest reasons for the jump from 2 to
3. (Although it isn't *completely* surprising since, as we all
know far too well here, fully correct Unicode is *not* easy.)
Late reply here. Python 3 is a lot better in terms of Unicode
support than 2. The situation in Python 2 was this.
1. The default string type is 'str', an immutable array of bytes.
2. 'str' could be one of many encodings, including UTF-16, etc.
3. There is an extra 'unicode' type for when you want a Unicode
string.
4. Python implicltly converts between the two, often in wrong
ways, often causing exceptions to appear where you didn't expect
them to.
In 3, this changed to...
1. The default string type is still named 'str', only now it's
like the 'unicode' of olde.
2. 'bytes' is a new immutable array of bytes type like the Python
2 'str'.
3. Conversion between 'str' and 'bytes' is always explicit.
However, Python 3 works on a code point level, probably some code
unit level in fact, and you don't see very many algorithms which
take, say, combining characters into account. So Python suffers
from similar issues.