Control: reassign 1010821 pypdf2/2.4.2-1 Control: forwarded 1010821 https://github.com/py-pdf/PyPDF2/issues/1111 Control: retitle 1010821 PyPDF2 fails to read a PDF file with a beginbfchar entry with an empty second element Control: affects 1010821 + src:xml2rfc src:weasyprint
On Tue 2022-05-10 21:53:30 +0200, Paul Gevers wrote: > With a recent upload of pypdf2 the autopkgtest of xml2rfc fails in > testing when that autopkgtest is run with the binary packages of pypdf2 > from unstable. It passes when run with only packages from testing. In > tabular form: the problem here is indeed a bug in the latest versions of PyPDF2. I've traced it back to a failure in how PyPDF2 deals with an empty second element in a bfchar list: https://github.com/py-pdf/PyPDF2/issues/1111 You can replicate the problem with this file (habibi.html): -------- <!DOCTYPE html> <head> <meta charset="utf-8"> <title>habibi</title> </head> <body> <div>حَبيبي habibi</div> </body> </html> -------- Feed it through weasyprint from the command line: ------ weasyprint habibi.html habibi.pdf ------ and then in python: ------- from PyPDF2 import PdfReader r = PdfReader('habibi.pdf') t = r.pages[0].extract_text() ------- This causes a crash in PyPDF2. The crash can be worked around with this patch: ----------------- --- a/PyPDF2/_cmap.py +++ b/PyPDF2/_cmap.py @@ -245,7 +245,7 @@ def parse_to_unicode( elif process_char: lst = [x for x in l.split(b" ") if x] map_dict[-1] = len(lst[0]) // 2 - while len(lst) > 0: + while len(lst) > 1: map_dict[ unhexlify(lst[0]).decode( "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass" ----------------- But the patch is insufficient, because then the result of extract_text() ("t" in the python above) is wrong. The problem has to do with subtly wrong parsing in _cmap.py's parse_to_unicode(). it does manual manipulation by removing angle brackets and then splitting and recombining strings based on whitespace. When the contents of some of the angle-brackets are empty, this technique doesn't work. --dkg
signature.asc
Description: PGP signature