[ https://issues.apache.org/jira/browse/PDFBOX-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018959#comment-18018959 ]
Tilman Hausherr edited comment on PDFBOX-6065 at 9/9/25 6:50 AM: ----------------------------------------------------------------- Thank you, it also works with my test files and improves the rendering of the files from PDFBOX-2989. Amusingly I once asked ChatGPT (shortly after it came out) to optimize the LZW code and the code failed the unit test. So ChatGPT has evolved a lot. was (Author: tilman): Thank you, it also works with my test files. Amusingly I once asked ChatGPT (shortly after it came out) to optimize the LZW code and the code failed the unit test. So ChatGPT has evolved a lot. > LZWFilter crashes, probably not handling the KwKwK special case > --------------------------------------------------------------- > > Key: PDFBOX-6065 > URL: https://issues.apache.org/jira/browse/PDFBOX-6065 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 2.0.34, 3.0.5 PDFBox > Reporter: Daniel Persson > Priority: Minor > Fix For: 2.0.35, 3.0.6 PDFBox, 4.0.0 > > Attachments: elvis5.pdf, lzwfilter.patch > > > The parsing throws an exception when trying to parse an image with the words > "The Legend" in the PDF. > java.io.IOException: negative array index: -1 near offset 1 > at org.apache.pdfbox.filter.LZWFilter.checkIndexBounds(LZWFilter.java:136) > at org.apache.pdfbox.filter.LZWFilter.doLZWDecode(LZWFilter.java:110) > at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:70) > > I've not looked into the Lempel-Ziv algorithm since the 90s, so I'm not up to > date with all the papers that have been published. And also, I've never read > the original welsh paper: > [https://courses.cs.duke.edu/spring03/cps296.5/papers/welch_1984_technique_for.pdf] > But it seems that ChatGPT was able to find this paper and suggest a patch by > rewriting the function handling all cases, not needing the bounds check at > all. Not saying that this is the right solution to the problem, but I ran it > against our 50k pages from multiple publishers and newspapers without any > visual artifacts, and it also works with the example provided in this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org