On 12/05/2022 08:45, Adrian Bool wrote:
Sure, I just tend to use pathlib for all my file handling as its
really useful and has been part of Python's standard library for a
good while now — so no extra package to install.
Good to know. I'll use pathlib from now on, as well as avoid reading the
whole file into a variable.
I don't know why the simple "et.parse(source_filename, parser)" is not
working for you. I suspect a bug. I came across the following item in
the XML spec:
Could be. I'm surprised of getting no hits from Google when I searched
before asking. I'm surely not the first one to use lxml to parse an HTML
file that holds random CRLF in paraphgras not followed by an HTML end of
line (<br>, </p>, etc.)
Had another issue with parse(): It doesn't like reading from a string,
so StringIO() is the way to go:
"""
Traceback (most recent call last):
File "C:\myscript.py", line 170, in <module>
tree = et.parse(content, parser)
File "src\lxml\etree.pyx", line 3521, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1859, in lxml.etree._parseDocument
File "src\lxml\parser.pxi", line 1885, in
lxml.etree._parseDocumentFromURL
File "src\lxml\parser.pxi", line 1789, in lxml.etree._parseDocFromFile
File "src\lxml\parser.pxi", line 1177, in
lxml.etree._BaseParser._parseDocFromFile
File "src\lxml\parser.pxi", line 615, in
lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 652, in lxml.etree._raiseParseError
OSError: Error reading file '<html>
... failed to load external entity "<html>
"""
→ tree = et.parse(StringIO(content), parser)
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com