[lxml] Re: [newbie] lxml adds before each end of line

Adrian Bool Wed, 11 May 2022 23:46:36 -0700

Hi Gilles

> On 11 May 2022, at 14:03, Gilles <[email protected]> wrote:
> Tuens out there's no need to use the pathlib module: The issue with "&#13;" 
> is gone when 1) first reading HTML into a variable 2) before parsing it, even 
> with the standard open():



Sure, I just tend to use pathlib for all my file handling as its really useful 
and has been part of Python's standard library for a good while now — so no 
extra package to install.

> 
> ============
> """ OK
> from pathlib import Path
> with Path(f).open() as tempfile:
>     tree = et.parse(tempfile, parser=parser)
> """
> 
> #BAD &#13;
> #tree = et.parse(f,parser)
> 
> #OK
> with open(f) as reader:
>     content = reader.read()
> #BAD tree=et.fromstring(content)
> tree  = et.parse(content, parser)
> ============
> 


I'd avoid your second "OK" version though, as that is reading all the original 
source into memory first; if the file is large this could be undesirable.  You 
can still use open() (as opposed to using pathlib) to provide the file handle 
to the parser, allowin the parser to pull in chunks of the source file as it 
proceeds with the parsing:

with open(source_filename) as source_file_handle:
        tree  = et.parse(source_file_handle, parser)

I don't know why the simple "et.parse(source_filename, parser)" is not working 
for you. I suspect a bug.  I came across the following item in the XML spec:

https://www.w3.org/TR/REC-xml/#sec-line-ends 
<https://www.w3.org/TR/REC-xml/#sec-line-ends>

2.11 End-of-Line Handling

XML parsed entities are often stored in computer files which, for editing 
convenience, are organized into lines. These lines are typically separated by 
some combination of the characters CARRIAGE RETURN (#xD) and LINE FEED (#xA).

To simplify the tasks of applications, the XML processor must behave as if it 
normalized all line breaks in external parsed entities (including the document 
entity) on input, before parsing, by translating both the two-character 
sequence #xD #xA and any #xD that is not followed by #xA to a single #xA 
character.

This is pretty clear in stating that you shouldn't be seeing the #13 (#xD in 
hex) end of line characters in your parsed data.

Cheers

aid

_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]

[lxml] Re: [newbie] lxml adds before each end of line

Reply via email to