Daniel Persson created PDFBOX-6065:
--------------------------------------

             Summary: LZWFilter crashes, probably not handling the KwKwK 
special case
                 Key: PDFBOX-6065
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6065
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 3.0.5 PDFBox
            Reporter: Daniel Persson
         Attachments: elvis5.pdf, lzwfilter.patch

The parsing throws an exception when trying to parse an image with the words 
"The Legend" in the PDF.

java.io.IOException: negative array index: -1 near offset 1
    at org.apache.pdfbox.filter.LZWFilter.checkIndexBounds(LZWFilter.java:136)
    at org.apache.pdfbox.filter.LZWFilter.doLZWDecode(LZWFilter.java:110)
    at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:70)

 

I've not looked into the Lempel-Ziv algorithm since the 90s, so I'm not up to 
date with all the papers that have been published. And also, I've never read 
the original welsh paper:

[https://courses.cs.duke.edu/spring03/cps296.5/papers/welch_1984_technique_for.pdf]

But it seems that ChatGPT was able to find this paper and suggest a patch by 
rewriting the function handling all cases, not needing the bounds check at all. 
Not saying that this is the right solution to the problem, but I ran it against 
our 50k pages from multiple publishers and newspapers without any visual 
artifacts, and it also works with the example provided in this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to