[lxml] Re: [newbie] lxml adds before each end of line

Gilles Thu, 12 May 2022 02:27:31 -0700

On 12/05/2022 08:45, Adrian Bool wrote:

Sure, I just tend to use pathlib for all my file handling as itsreally useful and has been part of Python's standard library for agood while now — so no extra package to install.

Good to know. I'll use pathlib from now on, as well as avoid reading thewhole file into a variable.

I don't know why the simple "et.parse(source_filename, parser)" is notworking for you. I suspect a bug. I came across the following item inthe XML spec:

Could be. I'm surprised of getting no hits from Google when I searchedbefore asking. I'm surely not the first one to use lxml to parse an HTMLfile that holds random CRLF in paraphgras not followed by an HTML end ofline (<br>, </p>, etc.)

Had another issue with parse(): It doesn't like reading from a string,so StringIO() is the way to go:


"""
Traceback (most recent call last):
  File "C:\myscript.py", line 170, in <module>
    tree = et.parse(content, parser)
  File "src\lxml\etree.pyx", line 3521, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1859, in lxml.etree._parseDocument

File "src\lxml\parser.pxi", line 1885, inlxml.etree._parseDocumentFromURL

  File "src\lxml\parser.pxi", line 1789, in lxml.etree._parseDocFromFile

File "src\lxml\parser.pxi", line 1177, inlxml.etree._BaseParser._parseDocFromFile File "src\lxml\parser.pxi", line 615, inlxml.etree._ParserContext._handleParseResultDoc

  File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 652, in lxml.etree._raiseParseError
OSError: Error reading file '<html>
... failed to load external entity "<html>
"""
→   tree = et.parse(StringIO(content), parser)

_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]

[lxml] Re: [newbie] lxml adds before each end of line

Reply via email to