On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton <juliushamilton...@gmail.com> wrote: > > Hey, > > Could anyone please comment on the purest way simply to strip HTML tags > from the internal text they surround? > > I know Beautiful Soup is a convenient tool, but I’m interested to know what > the most minimal way to do it would be.
That's definitely the best and most general way, and would still be my first thought most of the time. > People say you usually don’t use Regex for a second order language like > HTML, so I was thinking about using xpath or lxml, which seem like very > pure, universal tools for the job. > > I did find an example for doing this with the re module, though. > > Would it be fair to say that to just strip the tags, Regex is fine, but you > need to build a tree-like object if you want the ability to select which > nodes to keep and which to discard? Obligatory reference: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags > Can xpath / lxml do that? > > What are the chief differences between xpath / lxml and Beautiful Soup? > I've never directly used lxml, mainly because bs4 offers all the same advantages and more, with about the same costs. However, if you're looking for a no-external-deps option, Python *does* include an HTML parser in the standard library: https://docs.python.org/3/library/html.parser.html If your purpose is extremely simple (like "strip tags, search for text"), then it should be easy enough to whip up something using that module. No external deps, not a lot of code, pretty straight-forward. On the other hand, if you're trying to do an "HTML to text" conversion, you'd probably need to be aware of which tags are block-level and which are inline content, so that (for instance) "<div>Hello</div> <div>world</div>" would come out as two separate paragraphs of text, whereas the same thing with <b> tags would become just "Hello world". But for the most part, handle_data will probably do everything you need. ChrisA -- https://mail.python.org/mailman/listinfo/python-list