> -----Original Message-----
> From: Python-Dev [mailto:python-dev-
> bounces+kristjan=ccpgames....@python.org] On Behalf Of Stefan Ring
> Sent: 9. janúar 2014 09:32
> To: python-dev@python.org
> Subject: Re: [Python-Dev] Python3 "complexity"
> 
> > just became harder to use for that purpose.
> 
> The entire discussion reminds me very much of the situation with file names
> in OS X. Whenever I want to look at an old zip file or tarball which happens 
> to
> have been lying around on my hard drive for a decade or more, I can't
> because OS X insist that file names be encoded in
> UTF-8 and just throw errors if that requirement is not met. And certainly I
> cannot be required to re-encode all files to the then-favored encoding
> continually – although favors don’t change often and I’m willing to bet that
> UTF-8 is here to stay, but it has already happened twice in my active
> computer life (DOS -> latin-1 -> UTF-8).

Well, yes.
Also, the problem I'm describing has to do with real world stuff.
This is the python 2 program:
with open(fn1) as f1:
    with open(fn2, 'w') as f2:
        f2.write(process_text(f1.read())

Moving to python 3, I found that this quickly caused problems.  So, I 
explicitly added an encoding.  Better guess an encoding, something that is 
likely, e.g. cp1252
with open(fn1, encoding='cp1252') as f1:
    with open(fn2, 'w', encoding='cp1252') as f2:
        f2.write(process_text(f1.read())
        
This mostly worked.  But then, with real world data, sometimes we found that 
even files we declared to be cp1252, sometimes contained invalid code points.  
Was the file really in cp1252?  Or did someone mess up somewhere?  Or simply 
take a small poet's leave with the specification? 
This is when it started to become annoying.  I mean, clearly something was 
broken at some point, or I don't know the exactly correct encoding of the file. 
  But this is not the place to correct that mistake.  I want my program to be 
robust towards such errors.  And these errors exist.

So, the third version was:
with open(fn1, "b") as f1:
    with open(fn2, 'wb') as f2:
        f2.write(process_bytes(f1.read())

This works, but now I have a bytes object which is rather limited in what it 
can do.  Also, all all string constants in my process_bytes() function have to 
be b'foo', rather than 'foo'.

Only much later did I learn about 'surrogateescape'.  How is a new user to 
python to know about it?  The final version would probably be this:
with open(fn1, encoding='cp1252', errors='surrogateescape') as f1:
    with open(fn2, 'w', encoding='cp1252', errors='surrogateescape') as f2:
        f2.write(process_text(f1.read())

Will this always work?  I don't know.  I hope so.  But it seems very verbose 
when all you want to do is munge on some bytes.  And the 'surrogateescape' 
error handler is not something that a newcomer to the language, or someone 
coming from python2, is likely to automatically know about.

Could this be made simpler?  What If we had an encoding that combines 'ascii' 
and 'surrogateescape'?  Something that allows you to read ascii text with 
unknown high order bytes without this unneeded verbosity?  Something that would 
be immediately obvious to the newcomer?

K

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to