Re: [Python-ideas] Fix default encodings on Windows

Stephen J. Turnbull Wed, 17 Aug 2016 02:36:07 -0700

Paul Moore writes:
 > On 16 August 2016 at 16:56, Steve Dower <steve.do...@python.org> wrote:


 > > This discussion is for the developers who insist on using bytes
 > > for paths within Python, and the question is, "how do we best
 > > represent UTF-16 encoded paths in bytes?"

That's incomplete, AFAICS.  (Paul makes this point somewhat
differently.)  We don't want to represent paths in bytes on Windows if
we can avoid it.  Nor does UTF-16 really enter into it (except for the
technical issue of invalid surrogate pairs).  So a full statement is,
"How do we best represent Windows file system paths in bytes for
interoperability with systems that natively represent paths in bytes?"
("Other systems" refers to both other platforms and existing programs
on Windows.)

BTW, why "surrogate pairs"?  Does Windows validate surrogates to
ensure they come in pairs, but not necessarily in the right order (or
perhaps sometimes they resolve to non-characters such as U+1FFFF)?

Paul says:

 > People passing bytes to open() have in my view, already chosen not
 > to follow the standard advice of "decode incoming data at the
 > boundaries of your application". They may have good reasons for
 > that, but it's perfectly reasonable to expect them to take
  > responsibility for manually tracking the encoding of the resulting
 > bytes values flowing through their code.

Abstractly true, but in practice there's no such need for those who
made the choice!  In a properly set up POSIX locale[1], it Just Works by
design, especially if you use UTF-8 as the preferred encoding.  It's
Windows developers and users who suffer, not those who wrote the code,
nor their primary audience which uses POSIX platforms.

 > It is of course, also true that "works for me in my environment" is
 > a viable strategy - but the maintenance cost of this strategy if
 > things change (whether in Python, or in the environment) is on the
 > application developers - they are hoping that cost is minimal, but
 > that's a risk they choose to take.

Nick's point is that the risk is on Windows users and developers for
the Windows platform who did *not* make that choice, but rather had it
made for them by developers on a different platform where it Just
Works.  He argues that we should level the playing field.

It's also relevant that those developers on the originating platform
for the code typically resist complexifying changes to make things
work on other platforms too (cf. Victor's advocacy of removing the
bytes APIs on Windows).  Victor's points are good IMO; he's not just
resisting Windows, there are real resource consequences.

 > Code using Unicode is unaffected, certainly. Ideally that means that
 > only a tiny minority of users should be affected. Are we over-reacting
 > to reports of standard practices in Japan? I've no idea.

AFAIK, India and Southeast Asia have already abandoned their
indigenous standards in favor of Unicode/UTF-8, so it doesn't matter
if they use str or bytes, either way Steve's proposal will Just Work.
I don't know anything about Arabic, Hebrew, Cyrillic, and Eastern
Europeans.  That leaves China, which is like Japan in having had a
practically universal encoding (ie, every script you'll actually see
roundtrips, emoji being the only practical issue) since the 1970s.  So
I suspect Chinese also primarily use their local code page (GB2312 or
GB18030) for plain text documents, possibly including .ini and
Makefiles.

Over-reaction?  I have no idea either.  Just a potentially widespread
risk, both to users and to Python's reputation for maintaining
compatibility.  (I don't think it's "fair", but among my acquaintances
Python has a poor rep -- Steve's argument that if you develop code for
3.5 you should expect to have to modify it to use it with 3.6 cuts no
ice with them.)

 > > If you see an alternative choice to those listed above, feel free
 > > to contribute it. Otherwise, can we focus the discussion on these
 > > (or any new) choices?
 > 
 > Accept that we should have deprecated builtin open and the io module,
 > but didn't do so. Extend the existing deprecation of bytes paths on
 > Windows, to cover *all* APIs, not just the os module, But modify the
 > deprecation to be "use of the Windows CP_ACP code page (via the ...A
 > Win32 APIs) is deprecated and will be replaced with use of UTF-8 as
 > the implied encoding for all bytes paths on Windows starting in Python
 > 3.7". Document and publicise it much more prominently, as it is a
 > breaking change. Then leave it one release for people to prepare for
 > the change.

I like this one!  If my paranoid fears are realized, in practice it
might have to wait two releases, but at least this announcement should
get people who are at risk to speak up.  If they don't, then you can
just call me "Chicken Little" and go ahead!


Footnotes: 
[1]  An oxymoron, but there you go.


_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

Reply via email to