On 2022-08-22 19:27:28 -0000, Jon Ribbens via Python-list wrote: > On 2022-08-22, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > > On 2022-08-22 00:45:56 -0000, Jon Ribbens via Python-list wrote: > >> With the offset though, BeautifulSoup made an arbitrary decision to > >> use ISO-8859-1 encoding and so when you chopped the bytestring at > >> that offset it only worked because BeautifulSoup had happened to > >> choose a 1-byte-per-character encoding. Ironically, *without* the > >> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked. > > > > Actually it would. The unit is bytes if you feed it with bytes, and > > characters if you feed it with str. > > No it isn't. If you give BeautifulSoup's 'html.parser' bytes as input, > it first chooses an encoding and decodes the bytes before sending that > output to html.parser, which is what provides the offset. So the offsets > it gives are in characters, and you've no simple way of converting that > back to byte offsets.
Ah, I see. It "worked" for me because "\xed\xa0\x80\xed\xbc\x9f" isn't valid UTF-8. So Beautifulsoup decided to ignore the "<meta charset='utf-8'>" I had inserted before and used ISO-8859-1, providing me with correct byte offsets. If I replace that gibberish with a correct UTF-8 sequence (e.g. "\x4B\xC3\xA4\x73\x65") the UTF-8 is decoded and I get a character offset. > >> It looks like BeautifulSoup is doing something like that, yes. > >> Personally I would be nervous about some of my files being parsed > >> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather > >> than some of the files actually *being* ISO-8859-1 ;-) ) > > > > Since none of the syntactically meaningful characters have a code >= > > 0x80, you can parse HTML at the byte level if you know that it's encoded > > in a strict superset of ASCII (which all of the ISO-8859 family and > > UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16 > > (or Shift-JIS or EUC, if I remember correctly) then you have to know > > the the character set. > > > > (By parsing I mean only "create a syntax tree". Obviously you have to > > know the encoding to know whether to display =ABc3 bc=BB as =AB=FC=BB or = > >=AB=C3=BC=BB.) > > But the job here isn't to create a syntax tree. It's to change some of > the content, which for all we know is not ASCII. We know it's URLs, and the canonical form of an URL is ASCII. The URLs in the files may not be, but if they aren't you'll have to deal with variants anyway. And the start and end of the attribute can be determined in any strict superset of ASCII including UTF-8. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | h...@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list