Just to avoid confusion, let me state up front that I am very well aware of 
encodings and all that, having internationalized one largish app in python 2.x. 
 I know the problems that 2.x had with tracking down the source of errors and 
understand the beautiful concept of encodings on the boundary.

However:
For a  lot of data processing and tools, encoding isn't an issue.  Either you 
assume ascii, or you're working with something like latin1.  A single byte 
encoding.  This is because you're working with a text file that _you_ wrote.  
And you're not assigning any semantics to the characters.  If there is actual 
"text" in there it is just english, not Norwegian or Turkish. A byte read at 
code 0xfa doesn't mean anything special.  It's just that, a byte with that 
value.  The file system doesn't have any default encoding.  A file on disk is 
just a file on disk consisting of bytes.  There can never be any wrong 
encoding, no mojibake.

With python 2, you can read that file into a string object.  You can scan for 
your field delimiter, e.g. a comma, split up your string, interpolate some 
binary data, spit it out again.  All without ever thinking about encodings.  

Even though the file is conceptually encoded in something, if you insist on 
attaching a particular semantic meaning to every ordinal value, whatever that 
meaning is is in many cases irrelevant to the program.

I understand that surrogateescape allows you to do this.  But it is an awkward 
extra step and forces an extra layer of needles semantics on to that guy that 
just wants to read a file.  Sure, vegetarians and alergics like to read the 
list of ingredients on everything that they eat.  But others are just omnivores 
and want to be able to eat whatever is on the table, and not worry about what 
it is made of.
And yes, you can read the file in binary mode but then you end up with those 
bytes objects that we have just found that are tedious to work with.

So, what I'm saying is that at least I have a very common use case that has 
just become a) more confusing (having to needlessly derail the train of thought 
about the data processing to be done by thinking about text encodings) and b) 
more complicated.
Not sure if there is anything to be done about it though :)

I think there might be a different analogy:  Having to specify an encoding is 
like having strong typing.  In Python 2.7, we _can_ forego that and just 
duck-type our strings :)

K
________________________________________
From: Python-Dev [python-dev-bounces+kristjan=ccpgames....@python.org] on 
behalf of R. David Murray [rdmur...@bitdance.com]
Sent: Wednesday, January 08, 2014 23:40
To: python-dev@python.org
Subject: Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add   
bytes...)


Why *do* you care?  Isn't your system configured for utf-8, and all your
.txt files encoded with utf-8 by default?  Or at least configured
with a single consistent encoding?  If that's the case, Python3
doesn't make you think about the encoding.  Knowing the right encoding
is different from needing to know the difference between text and bytes;
you only need to worry about encodings when your system isn't configured
consistently to begin with.

If you do have to care, your little utilities only work by accident in
Python2, and must have produced mojibake when the encoding was wrong,
unless I'm completely confused.  So yeah, sorting that out is harder if
you were just living with the mojibake before...but if so I'm surprised
you haven't wanted to fix that before this.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to