Thanks.

I can live with calling parse() with a file handle instead of a filename:

===========
parser = et.HTMLParser(encoding='latin1',remove_blank_text=True,recover=True)

#BAD 
tree = et.parse(INPUT,parser=parser)

#OK
with open(INPUT) as tempfile:
    tree = et.parse(tempfile, parser=parser)

root = tree.getroot()
et.dump(root)
===========

On 15/05/2022 08:30, Pedro Andres Aranda Gutierrez wrote:
Answer to self: use parser in incremental mode, traverse the read buffer and chop it into slices delimited by but not including the CR's.

Best, /PA

On Sat, 14 May 2022 at 08:06, Pedro Andres Aranda Gutierrez <paag...@gmail.com> wrote:

    OK, just for reference, attached is my MWE . Get the ZIP file from
    gutenberg.org <http://gutenberg.org> with

    wget https://www.gutenberg.org/files/68047/68047-h.zip

    lxml version 4.8, python 3.9 on Ubuntu 20.04 or macOS BigSur

    Those &#13; are really annoying....

    Best, /PA

    On Fri, 13 May 2022 at 12:47, Gilles <codecompl...@free.fr> wrote:

        On 12/05/2022 22:32, Adrian Bool wrote:
        On 12 May 2022, at 10:26, Gilles <codecompl...@free.fr> wrote:
          File "src\lxml\parser.pxi", line 652, in
        lxml.etree._raiseParseError
        OSError: Error reading file*'<html>*

        Look at the last line above - you're giving parse() a string
        containing XML data which the parse() function is treating as
        a filename; trying to open a file with a name equivalent to
        your XML content!

        If you want to parse an XML string - use et.fromstring() instead.

        The StringIO call may be reasonable if your XML didn't exist
        on disk; but if your source data is on disk best to either
        give parse() the filename (but then you get your #13 issue)
        or pass it a file handle provided by open().

        Sorry I overlooked the last line. I dumbly supposed that
        parse() could take either a file handle or a string.

        _______________________________________________
        lxml - The Python XML Toolkit mailing list -- lxml@python.org
        To unsubscribe send an email to lxml-le...@python.org
        https://mail.python.org/mailman3/lists/lxml.python.org/
        Member address: paag...@gmail.com



-- Fragen sind nicht da um beantwortet zu werden,
    Fragen sind da um gestellt zu werden
    Georg Kreisler

    Headaches with a Juju log:
    unit-basic-16: 09:17:36 WARNING juju.worker.uniter.operation we
    should run a leader-deposed hook here, but we can't yet



--
Fragen sind nicht da um beantwortet zu werden,
Fragen sind da um gestellt zu werden
Georg Kreisler

Headaches with a Juju log:
unit-basic-16: 09:17:36 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to