Hi Gilles
> On 11 May 2022, at 14:03, Gilles <codecompl...@free.fr> wrote:
> Tuens out there's no need to use the pathlib module: The issue with " "
> is gone when 1) first reading HTML into a variable 2) before parsing it, even
> with the standard open():
Sure, I just tend to use pathlib for all my file handling as its really useful
and has been part of Python's standard library for a good while now — so no
extra package to install.
>
> ============
> """ OK
> from pathlib import Path
> with Path(f).open() as tempfile:
> tree = et.parse(tempfile, parser=parser)
> """
>
> #BAD
> #tree = et.parse(f,parser)
>
> #OK
> with open(f) as reader:
> content = reader.read()
> #BAD tree=et.fromstring(content)
> tree = et.parse(content, parser)
> ============
>
I'd avoid your second "OK" version though, as that is reading all the original
source into memory first; if the file is large this could be undesirable. You
can still use open() (as opposed to using pathlib) to provide the file handle
to the parser, allowin the parser to pull in chunks of the source file as it
proceeds with the parsing:
with open(source_filename) as source_file_handle:
tree = et.parse(source_file_handle, parser)
I don't know why the simple "et.parse(source_filename, parser)" is not working
for you. I suspect a bug. I came across the following item in the XML spec:
https://www.w3.org/TR/REC-xml/#sec-line-ends
<https://www.w3.org/TR/REC-xml/#sec-line-ends>
2.11 End-of-Line Handling
XML parsed entities are often stored in computer files which, for editing
convenience, are organized into lines. These lines are typically separated by
some combination of the characters CARRIAGE RETURN (#xD) and LINE FEED (#xA).
To simplify the tasks of applications, the XML processor must behave as if it
normalized all line breaks in external parsed entities (including the document
entity) on input, before parsing, by translating both the two-character
sequence #xD #xA and any #xD that is not followed by #xA to a single #xA
character.
This is pretty clear in stating that you shouldn't be seeing the #13 (#xD in
hex) end of line characters in your parsed data.
Cheers
aid
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com