On 12/05/2022 08:45, Adrian Bool wrote:
Sure, I just tend to use pathlib for all my file handling as its really useful and has been part of Python's standard library for a good while now — so no extra package to install.

Good to know. I'll use pathlib from now on, as well as avoid reading the whole file into a variable.

I don't know why the simple "et.parse(source_filename, parser)" is not working for you. I suspect a bug.  I came across the following item in the XML spec:

Could be. I'm surprised of getting no hits from Google when I searched before asking. I'm surely not the first one to use lxml to parse an HTML file that holds random CRLF in paraphgras not followed by an HTML end of line (<br>, </p>, etc.)

Had another issue with parse(): It doesn't like reading from a string, so StringIO() is the way to go:

"""
Traceback (most recent call last):
  File "C:\myscript.py", line 170, in <module>
    tree = et.parse(content, parser)
  File "src\lxml\etree.pyx", line 3521, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1859, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1885, in lxml.etree._parseDocumentFromURL
  File "src\lxml\parser.pxi", line 1789, in lxml.etree._parseDocFromFile
  File "src\lxml\parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile   File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 652, in lxml.etree._raiseParseError
OSError: Error reading file '<html>
... failed to load external entity "<html>
"""
→   tree = et.parse(StringIO(content), parser)
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to