Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1

2022-07-19 Thread GCS
Version: 2.6.0-1

On Tue, May 10, 2022 at 9:57 PM Paul Gevers  wrote:
> With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in
> testing when that autopkgtest is run with the binary packages of pypdf2
> from unstable.
 This was fixed in the recent pypdf2 upload, but forgot to close this
bug report.

Laszlo/GCS



Processed (with 1 error): Re: Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1

2022-07-15 Thread Debian Bug Tracking System
Processing control commands:

> reassign 1010821 pypdf2/2.4.2-1
Unknown command or malformed arguments to command.

> forwarded 1010821 https://github.com/py-pdf/PyPDF2/issues/
Bug #1010821 [src:xml2rfc] pypdf2 breaks xml2rfc autopkgtest: 
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Set Bug forwarded-to-address to 'https://github.com/py-pdf/PyPDF2/issues/'.
> retitle 1010821 PyPDF2 fails to read a PDF file with a beginbfchar entry with 
> an empty second element
Bug #1010821 [src:xml2rfc] pypdf2 breaks xml2rfc autopkgtest: 
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1
Changed Bug title to 'PyPDF2 fails to read a PDF file with a beginbfchar entry 
with an empty second element' from 'pypdf2 breaks xml2rfc autopkgtest: 
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1'.
> affects 1010821 + src:xml2rfc src:weasyprint
Bug #1010821 [src:xml2rfc] PyPDF2 fails to read a PDF file with a beginbfchar 
entry with an empty second element
Added indication that 1010821 affects src:xml2rfc and src:weasyprint

-- 
1010821: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1010821
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1

2022-07-15 Thread Daniel Kahn Gillmor
Control: reassign 1010821 pypdf2/2.4.2-1
Control: forwarded 1010821 https://github.com/py-pdf/PyPDF2/issues/
Control: retitle 1010821 PyPDF2 fails to read a PDF file with a beginbfchar 
entry with an empty second element
Control: affects 1010821 + src:xml2rfc src:weasyprint

On Tue 2022-05-10 21:53:30 +0200, Paul Gevers wrote:
> With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in 
> testing when that autopkgtest is run with the binary packages of pypdf2 
> from unstable. It passes when run with only packages from testing. In 
> tabular form:

the problem here is indeed a bug in the latest versions of PyPDF2.  I've
traced it back to a failure in how PyPDF2 deals with an empty second
element in a bfchar list:

   https://github.com/py-pdf/PyPDF2/issues/

You can replicate the problem with this file (habibi.html):





habibi


حَبيبي habibi




Feed it through weasyprint from the command line:

--
weasyprint habibi.html habibi.pdf
--

and then in python:

---
from PyPDF2 import PdfReader
r = PdfReader('habibi.pdf')
t = r.pages[0].extract_text()
---

This causes a crash in PyPDF2.  The crash can be worked around with this
patch:

-
--- a/PyPDF2/_cmap.py
+++ b/PyPDF2/_cmap.py
@@ -245,7 +245,7 @@ def parse_to_unicode(
 elif process_char:
 lst = [x for x in l.split(b" ") if x]
 map_dict[-1] = len(lst[0]) // 2
-while len(lst) > 0:
+while len(lst) > 1:
 map_dict[
 unhexlify(lst[0]).decode(
 "charmap" if map_dict[-1] == 1 else "utf-16-be", 
"surrogatepass"
-

But the patch is insufficient, because then the result of extract_text()
("t" in the python above) is wrong.  The problem has to do with subtly
wrong parsing in _cmap.py's parse_to_unicode().  it does manual
manipulation by removing angle brackets and then splitting and
recombining strings based on whitespace.  When the contents of some of
the angle-brackets are empty, this technique doesn't work.

--dkg


signature.asc
Description: PGP signature


Bug#1010821: pypdf2 breaks xml2rfc autopkgtest: lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1

2022-05-10 Thread Paul Gevers

Source: pypdf2, xml2rfc
Control: found -1 pypdf2/1.27.12-1
Control: found -1 xml2rfc/3.12.4-1
Severity: serious
Tags: sid bookworm
User: debian...@lists.debian.org
Usertags: breaks needs-update

Dear maintainer(s),

With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in 
testing when that autopkgtest is run with the binary packages of pypdf2 
from unstable. It passes when run with only packages from testing. In 
tabular form:


   passfail
pypdf2 from testing1.27.12-1
xml2rfcfrom testing3.12.4-1
all others from testingfrom testing

I copied some of the output at the bottom of this report.

Currently this regression is blocking the migration of pypdf2 to testing 
[1]. Due to the nature of this issue, I filed this bug report against 
both packages. Can you please investigate the situation and reassign the 
bug to the right package?


More information about this bug and the reason for filing it can be found on
https://wiki.debian.org/ContinuousIntegration/RegressionEmailInformation

Paul

[1] https://qa.debian.org/excuses.php?package=pypdf2

https://ci.debian.net/data/autopkgtest/testing/amd64/x/xml2rfc/21504535/log.gz

==
ERROR: setUpClass (__main__.PdfWriterTests)
--
Traceback (most recent call last):
  File 
"/tmp/autopkgtest-lxc.mlxdmdjo/downtmp/build.EDj/src/xxx/test.py", line 
495, in setUpClass

cls.elements_pdfxml = xmldoc(None, bytes=elements_pdfdoc)
  File "/usr/lib/python3/dist-packages/xml2rfc/walkpdf.py", line 97, in 
xmldoc

return lxml.etree.fromstring(text)
  File "src/lxml/etree.pyx", line 3252, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1793, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1082, in 
lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 615, in 
lxml.etree._ParserContext._handleParseResultDoc

  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "", line 11931
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1, line 11931, column 5

--
Ran 42 tests in 32.420s

FAILED (errors=1)
autopkgtest [04:57:54]: test run-pytest



OpenPGP_signature
Description: OpenPGP digital signature