> On 19 Aug 2022, at 22:04, Chris Angelico <ros...@gmail.com> wrote: > > On Sat, 20 Aug 2022 at 05:12, Barry <ba...@barrys-emacs.org> wrote: >> >> >> >>>> On 19 Aug 2022, at 19:33, Chris Angelico <ros...@gmail.com> wrote: >>> >>> What's the best way to precisely reconstruct an HTML file after >>> parsing it with BeautifulSoup? >> >> I recall that in bs4 it parses into an object tree and loses the detail of >> the input. >> I recently ported from very old bs to bs4 and hit the same issue. >> So no it will not output the same as went in. >> >> If you can trust the input to be parsed as xml, meaning all the rules of >> closing >> tags have been followed. Then I think you can parse and unparse thru xml to >> do what you want. >> > > > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh > well. Thanks for trying, anyhow. > > So I'm left with a few options: > > 1) Give up on validation, give up on verification, and just run this > thing on the production site with my fingers crossed
Can you build a beta site with original intack? Also wonder if using selenium to walk the site may work as a verification step? I cannot recall if you can get an image of the browser window to do image compares with to look for rendering differences. From my one task using bs4 I did not see it produce any bad results. In my case the problems where in the code that built on bs1 using bad assumptions. > 2) Instead of doing an intelligent reconstruction, just str.replace() > one URL with another within the file > 3) Split the file into lines, find the Nth line (elem.sourceline) and > str.replace that line only > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start > of the tag, manually find the end, and replace one tag with the > reconstructed form. > > I'm inclined to the first option, honestly. The others just seem like > hard work, and I became a programmer so I could be lazy... > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list