Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Version: 2.6.0-1 On Tue, May 10, 2022 at 9:57 PM Paul Gevers wrote: > With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in > testing when that autopkgtest is run with the binary packages of pypdf2 > from unstable. This was fixed in the recent pypdf2 upload, but forgot to close this bug report. Laszlo/GCS
Processed (with 1 error): Re: Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Processing control commands: > reassign 1010821 pypdf2/2.4.2-1 Unknown command or malformed arguments to command. > forwarded 1010821 https://github.com/py-pdf/PyPDF2/issues/ Bug #1010821 [src:xml2rfc] pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1 Set Bug forwarded-to-address to 'https://github.com/py-pdf/PyPDF2/issues/'. > retitle 1010821 PyPDF2 fails to read a PDF file with a beginbfchar entry with > an empty second element Bug #1010821 [src:xml2rfc] pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1 Changed Bug title to 'PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element' from 'pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1'. > affects 1010821 + src:xml2rfc src:weasyprint Bug #1010821 [src:xml2rfc] PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element Added indication that 1010821 affects src:xml2rfc and src:weasyprint -- 1010821: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1010821 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems
Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Control: reassign 1010821 pypdf2/2.4.2-1 Control: forwarded 1010821 https://github.com/py-pdf/PyPDF2/issues/ Control: retitle 1010821 PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element Control: affects 1010821 + src:xml2rfc src:weasyprint On Tue 2022-05-10 21:53:30 +0200, Paul Gevers wrote: > With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in > testing when that autopkgtest is run with the binary packages of pypdf2 > from unstable. It passes when run with only packages from testing. In > tabular form: the problem here is indeed a bug in the latest versions of PyPDF2. I've traced it back to a failure in how PyPDF2 deals with an empty second element in a bfchar list: https://github.com/py-pdf/PyPDF2/issues/ You can replicate the problem with this file (habibi.html): habibi حَبيبي habibi Feed it through weasyprint from the command line: -- weasyprint habibi.html habibi.pdf -- and then in python: --- from PyPDF2 import PdfReader r = PdfReader('habibi.pdf') t = r.pages[0].extract_text() --- This causes a crash in PyPDF2. The crash can be worked around with this patch: - --- a/PyPDF2/_cmap.py +++ b/PyPDF2/_cmap.py @@ -245,7 +245,7 @@ def parse_to_unicode( elif process_char: lst = [x for x in l.split(b" ") if x] map_dict[-1] = len(lst[0]) // 2 -while len(lst) > 0: +while len(lst) > 1: map_dict[ unhexlify(lst[0]).decode( "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass" - But the patch is insufficient, because then the result of extract_text() ("t" in the python above) is wrong. The problem has to do with subtly wrong parsing in _cmap.py's parse_to_unicode(). it does manual manipulation by removing angle brackets and then splitting and recombining strings based on whitespace. When the contents of some of the angle-brackets are empty, this technique doesn't work. --dkg signature.asc Description: PGP signature
Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Source: pypdf2, xml2rfc Control: found -1 pypdf2/1.27.12-1 Control: found -1 xml2rfc/3.12.4-1 Severity: serious Tags: sid bookworm User: debian...@lists.debian.org Usertags: breaks needs-update Dear maintainer(s), With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in testing when that autopkgtest is run with the binary packages of pypdf2 from unstable. It passes when run with only packages from testing. In tabular form: passfail pypdf2 from testing1.27.12-1 xml2rfcfrom testing3.12.4-1 all others from testingfrom testing I copied some of the output at the bottom of this report. Currently this regression is blocking the migration of pypdf2 to testing [1]. Due to the nature of this issue, I filed this bug report against both packages. Can you please investigate the situation and reassign the bug to the right package? More information about this bug and the reason for filing it can be found on https://wiki.debian.org/ContinuousIntegration/RegressionEmailInformation Paul [1] https://qa.debian.org/excuses.php?package=pypdf2 https://ci.debian.net/data/autopkgtest/testing/amd64/x/xml2rfc/21504535/log.gz == ERROR: setUpClass (__main__.PdfWriterTests) -- Traceback (most recent call last): File "/tmp/autopkgtest-lxc.mlxdmdjo/downtmp/build.EDj/src/xxx/test.py", line 495, in setUpClass cls.elements_pdfxml = xmldoc(None, bytes=elements_pdfdoc) File "/usr/lib/python3/dist-packages/xml2rfc/walkpdf.py", line 97, in xmldoc return lxml.etree.fromstring(text) File "src/lxml/etree.pyx", line 3252, in lxml.etree.fromstring File "src/lxml/parser.pxi", line 1913, in lxml.etree._parseMemoryDocument File "src/lxml/parser.pxi", line 1793, in lxml.etree._parseDoc File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError File "", line 11931 lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1, line 11931, column 5 -- Ran 42 tests in 32.420s FAILED (errors=1) autopkgtest [04:57:54]: test run-pytest OpenPGP_signature Description: OpenPGP digital signature