Hi Gilles,

just FYI, you will need to do the cleaning before parsing for file-alike
objects like when reading from a ZIP file ;-)
In my case, I'm reading from the Internet, normal files and zip files and
the least common denominator is reading blocks
of a given size from the file and feeding the parser with copies of those
blocks with the CR removed...
My MWE above produces the ... 
 at the output...

Best, /PA

On Sun, 15 May 2022 at 11:37, Gilles <codecompl...@free.fr> wrote:

> Thanks.
>
> I can live with calling parse() with a file handle instead of a filename:
>
> ===========
> parser =
> et.HTMLParser(encoding='latin1',remove_blank_text=True,recover=True)
>
> #BAD &#13;
> tree = et.parse(INPUT,parser=parser)
>
> #OK
> with open(INPUT) as tempfile:
>     tree = et.parse(tempfile, parser=parser)
>
> root = tree.getroot()
> et.dump(root)
> ===========
>
> On 15/05/2022 08:30, Pedro Andres Aranda Gutierrez wrote:
>
> Answer to self: use parser in incremental mode, traverse the read buffer
> and chop it into slices delimited by but not including the CR's.
>
> Best, /PA
>
> On Sat, 14 May 2022 at 08:06, Pedro Andres Aranda Gutierrez <
> paag...@gmail.com> wrote:
>
>> OK, just for reference, attached is my MWE . Get the ZIP file from
>> gutenberg.org with
>>
>> wget https://www.gutenberg.org/files/68047/68047-h.zip
>>
>> lxml version 4.8, python 3.9 on Ubuntu 20.04 or macOS BigSur
>>
>> Those &#13; are really annoying....
>>
>> Best, /PA
>>
>> On Fri, 13 May 2022 at 12:47, Gilles <codecompl...@free.fr> wrote:
>>
>>> On 12/05/2022 22:32, Adrian Bool wrote:
>>>
>>> On 12 May 2022, at 10:26, Gilles <codecompl...@free.fr> wrote:
>>>
>>>   File "src\lxml\parser.pxi", line 652, in lxml.etree._raiseParseError
>>> OSError: Error reading file* '<html>*
>>>
>>>
>>> Look at the last line above - you're giving parse() a string containing
>>> XML data which the parse() function is treating as a filename; trying to
>>> open a file with a name equivalent to your XML content!
>>>
>>> If you want to parse an XML string - use et.fromstring() instead.
>>>
>>> The StringIO call may be reasonable if your XML didn't exist on disk;
>>> but if your source data is on disk best to either give parse() the filename
>>> (but then you get your #13 issue) or pass it a file handle provided by
>>> open().
>>>
>>> Sorry I overlooked the last line. I dumbly supposed that parse() could
>>> take either a file handle or a string.
>>> _______________________________________________
>>> lxml - The Python XML Toolkit mailing list -- lxml@python.org
>>> To unsubscribe send an email to lxml-le...@python.org
>>> https://mail.python.org/mailman3/lists/lxml.python.org/
>>> Member address: paag...@gmail.com
>>>
>>
>>
>> --
>> Fragen sind nicht da um beantwortet zu werden,
>> Fragen sind da um gestellt zu werden
>> Georg Kreisler
>>
>> Headaches with a Juju log:
>> unit-basic-16: 09:17:36 WARNING juju.worker.uniter.operation we should
>> run a leader-deposed hook here, but we can't yet
>>
>>
>
> --
> Fragen sind nicht da um beantwortet zu werden,
> Fragen sind da um gestellt zu werden
> Georg Kreisler
>
> Headaches with a Juju log:
> unit-basic-16: 09:17:36 WARNING juju.worker.uniter.operation we should run
> a leader-deposed hook here, but we can't yet
>
>
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- lxml@python.org
> To unsubscribe send an email to lxml-le...@python.org
> https://mail.python.org/mailman3/lists/lxml.python.org/
> Member address: paag...@gmail.com
>


-- 
Fragen sind nicht da um beantwortet zu werden,
Fragen sind da um gestellt zu werden
Georg Kreisler

Headaches with a Juju log:
unit-basic-16: 09:17:36 WARNING juju.worker.uniter.operation we should run
a leader-deposed hook here, but we can't yet
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to