On 2022-08-21, Chris Angelico <ros...@gmail.com> wrote: > On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list ><python-list@python.org> wrote: >> On 2022-08-21, Chris Angelico <ros...@gmail.com> wrote: >> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list >> ><python-list@python.org> wrote: >> >> On 2022-08-20, Chris Angelico <ros...@gmail.com> wrote: >> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram <r...@zedat.fu-berlin.de> >> >> > wrote: >> >> >> 2qdxy4rzwzuui...@potatochowder.com writes: >> >> >> >textual representations. That way, the following two elements are the >> >> >> >same (and similar with a collection of sub-elements in a different >> >> >> >order >> >> >> >in another document): >> >> >> >> >> >> The /elements/ differ. They have the /same/ infoset. >> >> > >> >> > That's the bit that's hard to prove. >> >> > >> >> >> The OP could edit the files with regexps to create a new version. >> >> > >> >> > To you and Jon, who also suggested this: how would that be beneficial? >> >> > With Beautiful Soup, I have the line number and position within the >> >> > line where the tag starts; what does a regex give me that I don't have >> >> > that way? >> >> >> >> You mean you could use BeautifulSoup to read the file and identify the >> >> bits you want to change by line number and offset, and then you could >> >> use that data to try and update the file, hoping like hell that your >> >> definition of "line" and "offset" are identical to BeautifulSoup's >> >> and that you don't mess up later changes when you do earlier ones (you >> >> could do them in reverse order of line and offset I suppose) and >> >> probably resorting to regexps anyway in order to find the part of the >> >> tag you want to change ... >> >> >> >> ... or you could avoid all that faff and just do re.sub()? >> > >> > Stefan answered in part, but I'll add that it is far FAR easier to do >> > the analysis with BS4 than regular expressions. I'm not sure what >> > "hoping like hell" is supposed to mean here, since the line and offset >> > have been 100% accurate in my experience; >> >> Given the string: >> >> b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?" >> >> what is the line number and offset of the question mark - and does >> BeautifulSoup agree with your answer? Does the answer to that second >> question change depending on what parser you tell BeautifulSoup to use? > > I'm not sure, because I don't know how to ask BS4 about the location > of a question mark. But I replaced that with a tag, and: > >>>> raw = b"\n >>>> \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8<body></body>" >>>> from bs4 import BeautifulSoup >>>> soup = BeautifulSoup(raw, "html.parser") >>>> soup.body.sourceline > 4 >>>> soup.body.sourcepos > 12 >>>> raw.split(b"\n")[3] > b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8<body></body>' >>>> raw.split(b"\n")[3][12:] > b'<body></body>' > > So, yes, it seems to be correct. (Slightly odd in that the sourceline > is 1-based but the sourcepos is 0-based, but that is indeed the case, > as confirmed with a much more straight-forward string.) > > And yes, it depends on the parser, but I'm using html.parser and it's fine.
Hah, yes, it appears html.parser does an end-run about my lovely carefully crafted hard case by not even *trying* to work out what type of line endings the file uses and is just hard-coded to only recognise "\n" as a line ending. With the offset though, BeautifulSoup made an arbitrary decision to use ISO-8859-1 encoding and so when you chopped the bytestring at that offset it only worked because BeautifulSoup had happened to choose a 1-byte-per-character encoding. Ironically, *without* the "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked. >> (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then >> I am happy with the program throwing an exception" then feel free to >> remove that substring from the question.) > > Malformed UTF-8 doesn't seem to be a problem. Every file here seems to > be either UTF-8 or ISO-8859, and in the latter case, I'm assuming > 8859-1. So I would probably just let this one go through as 8859-1. It looks like BeautifulSoup is doing something like that, yes. Personally I would be nervous about some of my files being parsed as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather than some of the files actually *being* ISO-8859-1 ;-) ) >> > the only part I'm unsure about is where the _end_ of the tag is (and >> > maybe there's a way I can use BS4 again to get that??). >> >> There doesn't seem to be. More to the point, there doesn't seem to be >> a way to find out where the *attributes* are, so as I said you'll most >> likely end up using regexps anyway. > > I'm okay with replacing an entire tag that needs to be changed. Oh, that seems like quite a big change to the original problem. > Especially if I can replace just the opening tag, not the contents and > closing tag. And in fact, I may just do that part by scanning for an > unencoded greater-than, on the assumptions that (a) BS4 will correctly > encode any greater-thans in attributes, But your input wasn't created by BeautifulSoup (was it?) > and (b) if there's a mis-encoded one in the input, the diff will be > small enough to eyeball, and a human should easily notice that the > text has been massively expanded and duplicated. I strongly suggest Stefan Ram's excellent suggestion that, regardless of how you *make* the change, you can use BeautifulSoup to do a pretty strong check that the changes effected are both (a) all the ones you intended and (b) none that you didn't intend. -- https://mail.python.org/mailman/listinfo/python-list