[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

MRAB Sun, 24 Jan 2021 10:21:09 -0800

On 2021-01-24 17:04, Chris Angelico wrote:

On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull
<[email protected]> wrote:

Chris Angelico writes:
 > Right, but as long as there's only one system encoding, that's not
 > our problem. If you're on a Greek system and you want to decode
 > ISO-8859-9 text, you have to state that explicitly. For the
 > situations where you want heuristics based on byte distributions,
 > there's always chardet.

But that's the big question.  If you're just going to fall back to
chardet, you might as well start there.  No?  Consider: if 'open'
detects the encoding for you, *you can't find out what it is*.  'open'
has no facility to tell you!


Isn't that what file objects have attributes for? You can find out,
for instance, what newlines a file uses, even if it's being
autodetected.

 > In theory, UTF-16 without a BOM can consist entirely of byte values
 > below 128,

It's not just theory, it's my life.  62/80 of the Japanese "hiragana"
syllabary is composed of 2 printing ASCII characters (including SPC).
A large fraction of the Han ideographs satisfy that condition, and I
wouldn't be surprised if a majority of the 1000 most common ones do.
(Not a good bet because half of the ideographs have a low byte > 127,
but the order of characters isn't random, so if you get a couple of
popular radicals that have 50 or so characters in a group in that
range, you'd be much of the way there.)

 > But there's no solution to that,

Well, yes, but that's my line. ;-)


Do you get files that lack the BOM? If so, there's fundamentally no
way for the autodetection to recognize them. That's why, in my
quickly-whipped-up algorithm above, I basically had it assume that no
BOM means not UTF-16. After all, there's no way to know whether it's
UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point
of it), so IMO it's not unreasonable to assert that all files that
don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using
the ASCII-compatible detection method.

(Of course, this is *ONLY* if you don't specify an encoding. That part
won't be going away.)

Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it'sprobably UTF16-BE and if you see patterns likeb'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.

You could also look for, say, sequences of Latin characters andsequences of Han characters.

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/TPJYIC6ECIDYKQV3R4NZ36PTQJPY3CDN/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

Reply via email to