Sure, I could remove all CRLF's in paragaphs that don't end with a </p>
like they should, although I'm happy with just using a file handle instead.
What's MWE?
On 17/05/2022 09:16, Pedro Andres Aranda Gutierrez wrote:
Hi Gilles,
just FYI, you will need to do the cleaning before parsing for
file-alike objects like when reading from a ZIP file ;-)
In my case, I'm reading from the Internet, normal files and zip files
and the least common denominator is reading blocks
of a given size from the file and feeding the parser with copies of
those blocks with the CR removed...
My MWE above produces the ... at the output...
Best, /PA
On Sun, 15 May 2022 at 11:37, Gilles <codecompl...@free.fr> wrote:
Thanks.
I can live with calling parse() with a file handle instead of a
filename:
===========
parser =
et.HTMLParser(encoding='latin1',remove_blank_text=True,recover=True)
#BAD
tree = et.parse(INPUT,parser=parser)
#OK
with open(INPUT) as tempfile:
tree = et.parse(tempfile, parser=parser)
root = tree.getroot()
et.dump(root)
===========
On 15/05/2022 08:30, Pedro Andres Aranda Gutierrez wrote:
Answer to self: use parser in incremental mode, traverse the read
buffer and chop it into slices delimited by but not including the
CR's.
Best, /PA
On Sat, 14 May 2022 at 08:06, Pedro Andres Aranda Gutierrez
<paag...@gmail.com> wrote:
OK, just for reference, attached is my MWE . Get the ZIP file
from gutenberg.org <http://gutenberg.org> with
wget https://www.gutenberg.org/files/68047/68047-h.zip
lxml version 4.8, python 3.9 on Ubuntu 20.04 or macOS BigSur
Those are really annoying....
Best, /PA
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com