Hi Gilles, just FYI, you will need to do the cleaning before parsing for file-alike objects like when reading from a ZIP file ;-) In my case, I'm reading from the Internet, normal files and zip files and the least common denominator is reading blocks of a given size from the file and feeding the parser with copies of those blocks with the CR removed... My MWE above produces the ... at the output...
Best, /PA On Sun, 15 May 2022 at 11:37, Gilles <codecompl...@free.fr> wrote: > Thanks. > > I can live with calling parse() with a file handle instead of a filename: > > =========== > parser = > et.HTMLParser(encoding='latin1',remove_blank_text=True,recover=True) > > #BAD > tree = et.parse(INPUT,parser=parser) > > #OK > with open(INPUT) as tempfile: > tree = et.parse(tempfile, parser=parser) > > root = tree.getroot() > et.dump(root) > =========== > > On 15/05/2022 08:30, Pedro Andres Aranda Gutierrez wrote: > > Answer to self: use parser in incremental mode, traverse the read buffer > and chop it into slices delimited by but not including the CR's. > > Best, /PA > > On Sat, 14 May 2022 at 08:06, Pedro Andres Aranda Gutierrez < > paag...@gmail.com> wrote: > >> OK, just for reference, attached is my MWE . Get the ZIP file from >> gutenberg.org with >> >> wget https://www.gutenberg.org/files/68047/68047-h.zip >> >> lxml version 4.8, python 3.9 on Ubuntu 20.04 or macOS BigSur >> >> Those are really annoying.... >> >> Best, /PA >> >> On Fri, 13 May 2022 at 12:47, Gilles <codecompl...@free.fr> wrote: >> >>> On 12/05/2022 22:32, Adrian Bool wrote: >>> >>> On 12 May 2022, at 10:26, Gilles <codecompl...@free.fr> wrote: >>> >>> File "src\lxml\parser.pxi", line 652, in lxml.etree._raiseParseError >>> OSError: Error reading file* '<html>* >>> >>> >>> Look at the last line above - you're giving parse() a string containing >>> XML data which the parse() function is treating as a filename; trying to >>> open a file with a name equivalent to your XML content! >>> >>> If you want to parse an XML string - use et.fromstring() instead. >>> >>> The StringIO call may be reasonable if your XML didn't exist on disk; >>> but if your source data is on disk best to either give parse() the filename >>> (but then you get your #13 issue) or pass it a file handle provided by >>> open(). >>> >>> Sorry I overlooked the last line. I dumbly supposed that parse() could >>> take either a file handle or a string. >>> _______________________________________________ >>> lxml - The Python XML Toolkit mailing list -- lxml@python.org >>> To unsubscribe send an email to lxml-le...@python.org >>> https://mail.python.org/mailman3/lists/lxml.python.org/ >>> Member address: paag...@gmail.com >>> >> >> >> -- >> Fragen sind nicht da um beantwortet zu werden, >> Fragen sind da um gestellt zu werden >> Georg Kreisler >> >> Headaches with a Juju log: >> unit-basic-16: 09:17:36 WARNING juju.worker.uniter.operation we should >> run a leader-deposed hook here, but we can't yet >> >> > > -- > Fragen sind nicht da um beantwortet zu werden, > Fragen sind da um gestellt zu werden > Georg Kreisler > > Headaches with a Juju log: > unit-basic-16: 09:17:36 WARNING juju.worker.uniter.operation we should run > a leader-deposed hook here, but we can't yet > > > _______________________________________________ > lxml - The Python XML Toolkit mailing list -- lxml@python.org > To unsubscribe send an email to lxml-le...@python.org > https://mail.python.org/mailman3/lists/lxml.python.org/ > Member address: paag...@gmail.com > -- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellt zu werden Georg Kreisler Headaches with a Juju log: unit-basic-16: 09:17:36 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com