[ 
https://issues.apache.org/jira/browse/PDFBOX-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407585#comment-16407585
 ] 

Andreas Hubold edited comment on PDFBOX-4162 at 3/21/18 8:26 AM:
-----------------------------------------------------------------

I'm getting an exception with the fix now. For test purposes I've cherry-picked 
your commit onto 2.0.8 but I guess this should not make a difference for the 
error. Just note, that line numbers in the exception may not match the state of 
your branch:
{noformat}
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSArray cannot 
be cast to org.apache.pdfbox.cos.COSNumber
    at org.apache.pdfbox.cos.COSArray.toFloatArray(COSArray.java:540)
    at 
org.apache.pdfbox.pdmodel.graphics.PDLineDashPattern.<init>(PDLineDashPattern.java:55)
    at 
org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState.getLineDashPattern(PDExtendedGraphicsState.java:284)
    at 
org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState.copyIntoGraphicsState(PDExtendedGraphicsState.java:89)
    at 
org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters.process(SetGraphicsStateParameters.java:61)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
    at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
{noformat}
In the debugger I could see that the value of 'dp' is a COSArray with two 
elements, the first one being an empty COSArray, the second one a COSInt with 
value 0. The empty array leads to the ClassCastException further below.


was (Author: ahubold):
I'm getting an exception with the fix now. For test purposes I've cherry-picked 
your commit onto 2.0.8 but I guess this should not make a difference for the 
error. Just note, that line numbers in the exception may not match the state of 
your branch:
{noformat}
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSArray cannot 
be cast to org.apache.pdfbox.cos.COSNumber
    at org.apache.pdfbox.cos.COSArray.toFloatArray(COSArray.java:540)
    at 
org.apache.pdfbox.pdmodel.graphics.PDLineDashPattern.<init>(PDLineDashPattern.java:55)
    at 
org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState.getLineDashPattern(PDExtendedGraphicsState.java:284)
    at 
org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState.copyIntoGraphicsState(PDExtendedGraphicsState.java:89)
    at 
org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters.process(SetGraphicsStateParameters.java:61)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
    at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
{noformat}
In the debugger I could see that the value of 'dp' is "COSArray\{[COSArray{[]}, 
COSInt\{0}]}", that it is contains an empty array as first element which leads 
to the ClassCastException further below.

> OutOfMemoryError in PDExtendedGraphicsState#getLineDashPattern
> --------------------------------------------------------------
>
>                 Key: PDFBOX-4162
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4162
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.8
>            Reporter: Andreas Hubold
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>
> I'm getting an OutOfMemoryError from PDFBox when parsing a certain PDF using 
> the Apache Tika App v 1.17 - which uses PDFBox 2.0.8 internally. This is 
> reproducible even with 8GB heap. 
>  
> The OutOfMemoryError happens in 
> org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState#getLineDashPattern,
>  which contains this piece of suspicious code: 
> {code:java}
> COSArray dp = (COSArray) dict.getDictionaryObject( COSName.D );
> if( dp != null )
> {
>     COSArray array = new COSArray();
>     dp.addAll(dp);
> {code}
> The last line is wrong. It appends all elements from 'dp' to 'dp' again, 
> effectively duplicating the elements in the list. Maybe the intention was to 
> add it to the created array instead.
>  
> Stacktrace: 
> {noformat}
> [Full GC (Allocation Failure)  4225609K->4224664K(5989888K), 32,9544686 secs]
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at java.util.Arrays.copyOf(Arrays.java:3210)
>     at java.util.Arrays.copyOf(Arrays.java:3181)
>     at java.util.ArrayList.grow(ArrayList.java:261)
>     at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
>     at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
>     at java.util.ArrayList.addAll(ArrayList.java:579)
>     at org.apache.pdfbox.cos.COSArray.addAll(COSArray.java:124)
>     at 
> org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState.getLineDashPattern(PDExtendedGraphicsState.java:280)
>     at 
> org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState.copyIntoGraphicsState(PDExtendedGraphicsState.java:89)
>     at 
> org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters.process(SetGraphicsStateParameters.java:61)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>     at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>     at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:168)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205)
>     at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486)
>     at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to