[lxml] Re: [newbie] lxml adds before each end of line

Gilles Tue, 17 May 2022 00:46:54 -0700

Sure, I could remove all CRLF's in paragaphs that don't end with a </p>like they should, although I'm happy with just using a file handle instead.


What's MWE?


On 17/05/2022 09:16, Pedro Andres Aranda Gutierrez wrote:

Hi Gilles,

just FYI, you will need to do the cleaning before parsing forfile-alike objects like when reading from a ZIP file ;-)In my case, I'm reading from the Internet, normal files and zip filesand the least common denominator is reading blocksof a given size from the file and feeding the parser with copies ofthose blocks with the CR removed...

My MWE above produces the ... &#13; at the output...

Best, /PA

On Sun, 15 May 2022 at 11:37, Gilles <[email protected]> wrote:

    Thanks.

    I can live with calling parse() with a file handle instead of a
    filename:

    ===========
    parser =
    et.HTMLParser(encoding='latin1',remove_blank_text=True,recover=True)

    #BAD &#13;
    tree = et.parse(INPUT,parser=parser)

    #OK
    with open(INPUT) as tempfile:
        tree = et.parse(tempfile, parser=parser)

    root = tree.getroot()
    et.dump(root)
    ===========

    On 15/05/2022 08:30, Pedro Andres Aranda Gutierrez wrote:

    Answer to self: use parser in incremental mode, traverse the read
    buffer and chop it into slices delimited by but not including the
    CR's.

    Best, /PA

    On Sat, 14 May 2022 at 08:06, Pedro Andres Aranda Gutierrez
    <[email protected]> wrote:

        OK, just for reference, attached is my MWE . Get the ZIP file
        from gutenberg.org <http://gutenberg.org> with

        wget https://www.gutenberg.org/files/68047/68047-h.zip

        lxml version 4.8, python 3.9 on Ubuntu 20.04 or macOS BigSur

        Those &#13; are really annoying....

        Best, /PA

_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]

[lxml] Re: [newbie] lxml adds before each end of line

Reply via email to