Control: reassign 1010821 pypdf2/2.4.2-1
Control: forwarded 1010821 https://github.com/py-pdf/PyPDF2/issues/1111
Control: retitle 1010821 PyPDF2 fails to read a PDF file with a beginbfchar 
entry with an empty second element
Control: affects 1010821 + src:xml2rfc src:weasyprint

On Tue 2022-05-10 21:53:30 +0200, Paul Gevers wrote:
> With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in 
> testing when that autopkgtest is run with the binary packages of pypdf2 
> from unstable. It passes when run with only packages from testing. In 
> tabular form:

the problem here is indeed a bug in the latest versions of PyPDF2.  I've
traced it back to a failure in how PyPDF2 deals with an empty second
element in a bfchar list:

   https://github.com/py-pdf/PyPDF2/issues/1111

You can replicate the problem with this file (habibi.html):

--------
<!DOCTYPE html>
<head>
<meta charset="utf-8">
<title>habibi</title>
</head>
<body>
<div>حَبيبي habibi</div>
</body>
</html>
--------

Feed it through weasyprint from the command line:

------
weasyprint habibi.html habibi.pdf
------

and then in python:

-------
from PyPDF2 import PdfReader
r = PdfReader('habibi.pdf')
t = r.pages[0].extract_text()
-------

This causes a crash in PyPDF2.  The crash can be worked around with this
patch:

-----------------
--- a/PyPDF2/_cmap.py
+++ b/PyPDF2/_cmap.py
@@ -245,7 +245,7 @@ def parse_to_unicode(
         elif process_char:
             lst = [x for x in l.split(b" ") if x]
             map_dict[-1] = len(lst[0]) // 2
-            while len(lst) > 0:
+            while len(lst) > 1:
                 map_dict[
                     unhexlify(lst[0]).decode(
                         "charmap" if map_dict[-1] == 1 else "utf-16-be", 
"surrogatepass"
-----------------

But the patch is insufficient, because then the result of extract_text()
("t" in the python above) is wrong.  The problem has to do with subtly
wrong parsing in _cmap.py's parse_to_unicode().  it does manual
manipulation by removing angle brackets and then splitting and
recombining strings based on whitespace.  When the contents of some of
the angle-brackets are empty, this technique doesn't work.

    --dkg

Attachment: signature.asc
Description: PGP signature

Reply via email to