On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list <python-list@python.org> wrote: > > On 2022-08-21, Chris Angelico <ros...@gmail.com> wrote: > > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list > ><python-list@python.org> wrote: > >> On 2022-08-20, Chris Angelico <ros...@gmail.com> wrote: > >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram <r...@zedat.fu-berlin.de> wrote: > >> >> 2qdxy4rzwzuui...@potatochowder.com writes: > >> >> >textual representations. That way, the following two elements are the > >> >> >same (and similar with a collection of sub-elements in a different > >> >> >order > >> >> >in another document): > >> >> > >> >> The /elements/ differ. They have the /same/ infoset. > >> > > >> > That's the bit that's hard to prove. > >> > > >> >> The OP could edit the files with regexps to create a new version. > >> > > >> > To you and Jon, who also suggested this: how would that be beneficial? > >> > With Beautiful Soup, I have the line number and position within the > >> > line where the tag starts; what does a regex give me that I don't have > >> > that way? > >> > >> You mean you could use BeautifulSoup to read the file and identify the > >> bits you want to change by line number and offset, and then you could > >> use that data to try and update the file, hoping like hell that your > >> definition of "line" and "offset" are identical to BeautifulSoup's > >> and that you don't mess up later changes when you do earlier ones (you > >> could do them in reverse order of line and offset I suppose) and > >> probably resorting to regexps anyway in order to find the part of the > >> tag you want to change ... > >> > >> ... or you could avoid all that faff and just do re.sub()? > > > > Stefan answered in part, but I'll add that it is far FAR easier to do > > the analysis with BS4 than regular expressions. I'm not sure what > > "hoping like hell" is supposed to mean here, since the line and offset > > have been 100% accurate in my experience; > > Given the string: > > b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?" > > what is the line number and offset of the question mark - and does > BeautifulSoup agree with your answer? Does the answer to that second > question change depending on what parser you tell BeautifulSoup to use?
I'm not sure, because I don't know how to ask BS4 about the location of a question mark. But I replaced that with a tag, and: >>> raw = b"\n >>> \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8<body></body>" >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(raw, "html.parser") >>> soup.body.sourceline 4 >>> soup.body.sourcepos 12 >>> raw.split(b"\n")[3] b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8<body></body>' >>> raw.split(b"\n")[3][12:] b'<body></body>' So, yes, it seems to be correct. (Slightly odd in that the sourceline is 1-based but the sourcepos is 0-based, but that is indeed the case, as confirmed with a much more straight-forward string.) And yes, it depends on the parser, but I'm using html.parser and it's fine. > (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then > I am happy with the program throwing an exception" then feel free to > remove that substring from the question.) Malformed UTF-8 doesn't seem to be a problem. Every file here seems to be either UTF-8 or ISO-8859, and in the latter case, I'm assuming 8859-1. So I would probably just let this one go through as 8859-1. > > the only part I'm unsure about is where the _end_ of the tag is (and > > maybe there's a way I can use BS4 again to get that??). > > There doesn't seem to be. More to the point, there doesn't seem to be > a way to find out where the *attributes* are, so as I said you'll most > likely end up using regexps anyway. I'm okay with replacing an entire tag that needs to be changed. Especially if I can replace just the opening tag, not the contents and closing tag. And in fact, I may just do that part by scanning for an unencoded greater-than, on the assumptions that (a) BS4 will correctly encode any greater-thans in attributes, and (b) if there's a mis-encoded one in the input, the diff will be small enough to eyeball, and a human should easily notice that the text has been massively expanded and duplicated. ChrisA -- https://mail.python.org/mailman/listinfo/python-list