On 07Mar2020 15:01, Christopher Barker <python...@gmail.com> wrote:
On Fri, Mar 6, 2020 at 5:54 PM Guido van Rossum <gu...@python.org> wrote:
(Since bytes may be used for file names I think they should get this
new capability too.)
I don’t really care one way or another, but is it really still the
case that bytes need to be used for filenames? For uses other than just passing
them around?
Yes, Linux in particular does not guarantee that file names are using any
particular encoding (let alone a consistent encoding for different files).
The only two bytes that are special are '\0' and '/'.
I *think* I understand the issues. And I can see that some software would
need to work with filenames as arbitrary bytes. But that doesn't mean that
you can do much with them that way.
Given that the entire UNIX filename API is bytes, I think this isn't
very true.
I can see filename.split(b'/') for instance, but how could you strip a
prefix or suffix without knowing the encoding?
Well, directly:
filename.cutsuffix(b'.abc')
But more seriously, you're either treating them as bytes with no
particular encoding and the above just means "remove these 4 bytes" or
you do know the encoding and are working with strings, so you'd either
have a string andcut a string, or have bytes and cut the value
'.abc'.encode(encoding=known_encoding).
Things like listdir are dual mode: call it with a bytes directory name
and you get bytes results, call it with a string directory name and you
get string results. There's some funky encoding accomodation in there
(read the docs, it's a little subtlety to do with returning strings
which didn't decode cleanly from the underlying bytes).
filename.strip_suffix(b'.txt') would only work for ASCII-compaitble
encodings.
Or b'.txt' is your known bytes encoding of some known string suffix in
your working encoding.
But like the other string-like bytes methods, I think there's a good
case for supporting bytes prefixes and suffixes; it is just a matter of
using the correct bytes affix in the regime you're working in. Might not
be filenames, either.
There's no way around the fact that you have to make SOME
assumptions about the encoding if you are going to do anything other than
pass it around or work with the b'/' byte.
They needn't be assumptions; all code has some outer context.
And if that's the case, then you
might as well decode and use 'surrogateescape' so the program won't crash.
Ah, I see you've encountered the listdir-return-string stuff already
then.
Getting OT, but I do wonder if we should continue to support (and therefor
encourage) the use of bytes in inappropriate ways.
I think there's plenty of reasonable bytes actions which look a lot like
string actions, and are not confusing. Consider this contrived example:
payload_bytes = packet_bytes.cutprefix(header_bytes)
There was an interesting writeup by a guy involved in the mercurial
Python 3 port where he discusses the pain which came with the bytes type
lacking a lot of the string support methods when Python 3 first came
out. He suggests a lot of things would have gone far smoother with
these, as Mercurial had a lot of filenames-as-bytes-strings inside. Here
we are:
https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journey-to-and-reflections-on-python-3/
Personally I lean the other way, and welcomed the initial lack of
stringish methods as a good way to uncover bytes mistakenly used for
strings. But I see his point.
Cheers,
Cameron Simpson <c...@cskk.id.au>
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/A7TYUKFN74XOOD5MJGBDG5GMUGNTEFXR/
Code of Conduct: http://python.org/psf/codeofconduct/