Daniel Persson created PDFBOX-6065:
--------------------------------------
Summary: LZWFilter crashes, probably not handling the KwKwK
special case
Key: PDFBOX-6065
URL: https://issues.apache.org/jira/browse/PDFBOX-6065
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 3.0.5 PDFBox
Reporter: Daniel Persson
Attachments: elvis5.pdf, lzwfilter.patch
The parsing throws an exception when trying to parse an image with the words
"The Legend" in the PDF.
java.io.IOException: negative array index: -1 near offset 1
at org.apache.pdfbox.filter.LZWFilter.checkIndexBounds(LZWFilter.java:136)
at org.apache.pdfbox.filter.LZWFilter.doLZWDecode(LZWFilter.java:110)
at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:70)
I've not looked into the Lempel-Ziv algorithm since the 90s, so I'm not up to
date with all the papers that have been published. And also, I've never read
the original welsh paper:
[https://courses.cs.duke.edu/spring03/cps296.5/papers/welch_1984_technique_for.pdf]
But it seems that ChatGPT was able to find this paper and suggest a patch by
rewriting the function handling all cases, not needing the bounds check at all.
Not saying that this is the right solution to the problem, but I ran it against
our 50k pages from multiple publishers and newspapers without any visual
artifacts, and it also works with the example provided in this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]