Package: python3-feedparser
Version: 5.2.1-1
Severity: normal
Dear maintainer(s),
The attached script uses feedparser to parse an invalid XHTML document.
If feedparser is installed from PyPI with pip, then the script succeeds
exists without error.
If feedparser is installed from Debian 10 repositories (or Archlinux, I
am told), it errors with: "TypeError: startswith first arg must be bytes
or a tuple of bytes, not str" (full traceback attached).
In all cases, feedparser 5.2.1 is used (5.2.1-1 on Debian).
I did not investigate further, but this might be caused by a different
version of sgmllib (bundled in Debian's python3-feedparser package)
-- System Information:
Debian Release: 10.2
APT prefers oldstable-debug
APT policy: (500, 'oldstable-debug'), (500, 'stable'), (500,
'oldstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: armhf
Kernel: Linux 4.19.0-6-amd64 (SMP w/4 CPU cores)
Kernel taint flags: TAINT_DIE, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8),
LANGUAGE=fr_FR.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
Versions of packages python3-feedparser depends on:
ii python3 3.7.3-1
python3-feedparser recommends no packages.
python3-feedparser suggests no packages.
-- no debconf information
import feedparser
data = '''<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom'>
<entry>
<content type='xhtml'><div xmlns='http://www.w3.org/1999/xhtml'>
<p><i></p>
</div></content>
</entry>
<entry>
<content type='xhtml'><div xmlns='http://www.w3.org/1999/xhtml'>
<p>™</p>
</div></content>
</entry>
</feed>
'''
feedparser.parse(data)
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line
352, in finish_endtag
method = getattr(self, 'end_' + tag)
AttributeError: '_LooseFeedParser' object has no attribute 'end_content'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "feedparser_invalid_xhtml.py", line 19, in <module>
feedparser.parse(data)
File "/usr/lib/python3/dist-packages/feedparser.py", line 3972, in parse
feedparser.feed(data.decode('utf-8', 'replace'))
File "/usr/lib/python3/dist-packages/feedparser.py", line 2131, in feed
sgmllib.SGMLParser.feed(self, data)
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 98,
in feed
self.goahead(0)
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line
137, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line
314, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line
354, in finish_endtag
self.unknown_endtag(tag)
File "/usr/lib/python3/dist-packages/feedparser.py", line 704, in
unknown_endtag
method()
File "/usr/lib/python3/dist-packages/feedparser.py", line 1840, in
_end_content
value = self.popContent('content')
File "/usr/lib/python3/dist-packages/feedparser.py", line 1011, in popContent
value = self.pop(tag)
File "/usr/lib/python3/dist-packages/feedparser.py", line 863, in pop
if piece.startswith('</'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
_______________________________________________
Python-modules-team mailing list
[email protected]
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/python-modules-team