Gábor Stefanik created PDFBOX-5879:
--------------------------------------

             Summary: Regression from PDFBOX-5841: Text extraction with 
rotation magic fails for PDF with multiple content streams in a page
                 Key: PDFBOX-5879
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5879
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 3.0.3 PDFBox
            Reporter: Gábor Stefanik
         Attachments: MVM_Aram_augusztus.pdf

{code:java}
java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
-i="MVM_Aram_augusztus.pdf" {code}
fails with the following error:
{code:java}
java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
        at 
org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
        at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
        at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
        at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
        at picocli.CommandLine.access$1500(CommandLine.java:148)
        at 
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
        at 
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
        at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
        at picocli.CommandLine.execute(CommandLine.java:2174)
        at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
The same command succeeds in 3.0.2.

The triggering PDF can be downloaded from 
[https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
 and is also attached.

The root cause appears to be this change: 
[https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
 from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to