On 2021-01-24 17:04, Chris Angelico wrote:
On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull
<turnbull.stephen...@u.tsukuba.ac.jp> wrote:
Chris Angelico writes:
> Right, but as long as there's only one system encoding, that's not
> our problem. If you're on a Greek system and you want to decode
> ISO-8859-9 text, you have to state that explicitly. For the
> situations where you want heuristics based on byte distributions,
> there's always chardet.
But that's the big question. If you're just going to fall back to
chardet, you might as well start there. No? Consider: if 'open'
detects the encoding for you, *you can't find out what it is*. 'open'
has no facility to tell you!
Isn't that what file objects have attributes for? You can find out,
for instance, what newlines a file uses, even if it's being
autodetected.
> In theory, UTF-16 without a BOM can consist entirely of byte values
> below 128,
It's not just theory, it's my life. 62/80 of the Japanese "hiragana"
syllabary is composed of 2 printing ASCII characters (including SPC).
A large fraction of the Han ideographs satisfy that condition, and I
wouldn't be surprised if a majority of the 1000 most common ones do.
(Not a good bet because half of the ideographs have a low byte > 127,
but the order of characters isn't random, so if you get a couple of
popular radicals that have 50 or so characters in a group in that
range, you'd be much of the way there.)
> But there's no solution to that,
Well, yes, but that's my line. ;-)
Do you get files that lack the BOM? If so, there's fundamentally no
way for the autodetection to recognize them. That's why, in my
quickly-whipped-up algorithm above, I basically had it assume that no
BOM means not UTF-16. After all, there's no way to know whether it's
UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point
of it), so IMO it's not unreasonable to assert that all files that
don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using
the ASCII-compatible detection method.
(Of course, this is *ONLY* if you don't specify an encoding. That part
won't be going away.)
Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's
probably UTF16-BE and if you see patterns like
b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
You could also look for, say, sequences of Latin characters and
sequences of Han characters.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/TPJYIC6ECIDYKQV3R4NZ36PTQJPY3CDN/
Code of Conduct: http://python.org/psf/codeofconduct/