> -----Original Message-----
> From: Python-Dev [mailto:python-dev-
> [email protected]] On Behalf Of Stefan Ring
> Sent: 9. janúar 2014 09:32
> To: [email protected]
> Subject: Re: [Python-Dev] Python3 "complexity"
>
> > just became harder to use for that purpose.
>
> The entire discussion reminds me very much of the situation with file names
> in OS X. Whenever I want to look at an old zip file or tarball which happens
> to
> have been lying around on my hard drive for a decade or more, I can't
> because OS X insist that file names be encoded in
> UTF-8 and just throw errors if that requirement is not met. And certainly I
> cannot be required to re-encode all files to the then-favored encoding
> continually – although favors don’t change often and I’m willing to bet that
> UTF-8 is here to stay, but it has already happened twice in my active
> computer life (DOS -> latin-1 -> UTF-8).
Well, yes.
Also, the problem I'm describing has to do with real world stuff.
This is the python 2 program:
with open(fn1) as f1:
with open(fn2, 'w') as f2:
f2.write(process_text(f1.read())
Moving to python 3, I found that this quickly caused problems. So, I
explicitly added an encoding. Better guess an encoding, something that is
likely, e.g. cp1252
with open(fn1, encoding='cp1252') as f1:
with open(fn2, 'w', encoding='cp1252') as f2:
f2.write(process_text(f1.read())
This mostly worked. But then, with real world data, sometimes we found that
even files we declared to be cp1252, sometimes contained invalid code points.
Was the file really in cp1252? Or did someone mess up somewhere? Or simply
take a small poet's leave with the specification?
This is when it started to become annoying. I mean, clearly something was
broken at some point, or I don't know the exactly correct encoding of the file.
But this is not the place to correct that mistake. I want my program to be
robust towards such errors. And these errors exist.
So, the third version was:
with open(fn1, "b") as f1:
with open(fn2, 'wb') as f2:
f2.write(process_bytes(f1.read())
This works, but now I have a bytes object which is rather limited in what it
can do. Also, all all string constants in my process_bytes() function have to
be b'foo', rather than 'foo'.
Only much later did I learn about 'surrogateescape'. How is a new user to
python to know about it? The final version would probably be this:
with open(fn1, encoding='cp1252', errors='surrogateescape') as f1:
with open(fn2, 'w', encoding='cp1252', errors='surrogateescape') as f2:
f2.write(process_text(f1.read())
Will this always work? I don't know. I hope so. But it seems very verbose
when all you want to do is munge on some bytes. And the 'surrogateescape'
error handler is not something that a newcomer to the language, or someone
coming from python2, is likely to automatically know about.
Could this be made simpler? What If we had an encoding that combines 'ascii'
and 'surrogateescape'? Something that allows you to read ascii text with
unknown high order bytes without this unneeded verbosity? Something that would
be immediately obvious to the newcomer?
K
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com