[ 
https://issues.apache.org/jira/browse/PDFBOX-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648352#comment-13648352
 ] 

James Green commented on PDFBOX-1586:
-------------------------------------

We aren't complaining that the getter itself is faulty. The problem is that 
there is more than one reference to the scratchFile. The COSDocument assumes it 
has exclusivity over that scratchFile and will close it.

Sometimes a (single) scratchFile is being used by more than one object. As the 
COSDocument that created the scratchFile falls out of scope and gets GC'd the 
finalize() calls close() on the scratchFile - but there's potentially another 
reference to it. If the COSDocument kept it's scratchFile private (remove the 
getter) this can no longer happen and make the closure during finalize safe.

What we are seeing is that this stack trace is the result of the second 
reference to the scratchFile being invoked after the owning instance 
COSDocument has been GC'd. The getter within the COSDocument won't ever be 
called on the COSDocument instance that originally created it when the crash 
occurs because that COSDocument instance has been GC'd already - potentially 
long ago!

COSDocument doc1 = new COSDocument()
RandomAccess sF1 = doc1.getScratchFile()
doc1 = null;
System.gc()
sF1.seek(...) <-- boom

Note that our code does not reach into the COSDocument for the scratch file - 
the PDFBox library code does - http://imagebin.org/256280 for what we mean.

It's a bit like having a factory close created instances then closing the 
factory - the instances are being used elsewhere.

Why not just add a finalize() on the RandomAccessBuffer that closes the 
resource?

If the architecture states that the scratchFile should indeed be exclusive to 
it's creating instance get rid of the getter else you cannot be assured of that 
fact.
                
> IndexOutOfBoundsException when saving a document (at random)
> ------------------------------------------------------------
>
>                 Key: PDFBOX-1586
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1586
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.1
>            Reporter: James Green
>            Priority: Critical
>
> Getting the following stacktrace:
> org.apache.pdfbox.exceptions.COSVisitorException: 
> java.lang.IndexOutOfBoundsException: Index: 28, Size: 0
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1245)
>     at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:201)
>     at org.apache.pdfbox.cos.COSObject.accept(COSObject.java:206)
>     at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:524)
>     at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:434)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1056)
>     at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:496)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1392)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1157)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1138)
> ...
> Caused by: java.lang.IndexOutOfBoundsException: Index: 28, Size: 0
>     at java.util.ArrayList.rangeCheck(ArrayList.java:604)
>     at java.util.ArrayList.get(ArrayList.java:382)
>     at 
> org.apache.pdfbox.io.RandomAccessBuffer.seek(RandomAccessBuffer.java:84)
>     at 
> org.apache.pdfbox.io.RandomAccessFileInputStream.read(RandomAccessFileInputStream.java:96)
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1232)
> I'll add some context. We have a "data pipeline" in which a Windows Print 
> Monitor sends postscript into a servlet which then uses GhostScript 9.05 to 
> convert in-memory to PDF. This PDF is then loaded into PDFBox using 
> PDDocument.load().
> At this point we split the original PDF into multiple smaller ones each of 
> which is saved to a ByteArrayOutputStream. At the point of save() we are 
> having serious reliability issues.
> Taking an original PDF from Ghostscript we have saved this into a unit test 
> to replicate the problem without success. If we attempt to re-execute the 
> pipeline to take the original PDF and split it, we get apparently random 
> percentages of saved documents.
> For instance, on a 990 page document (text, no images), to be split into 990 
> 1-page documents using Tomcat 7 with -Xmx=512m:
> Pass 1: 50% were saved, 50% ended with stack traces
> Pass 2: 100% were saved
> Pass 3: 100% were saved
> The same test with -Xmx=128m ended several times with just 1 document saved, 
> the rest were stack traces.
> We have also seen this randomly hit a sample document consisting of four 
> pages to be split into two two-page documents so it does not appear to be 
> memory related. We also added code to catch the IndexOutOfBoundsException and 
> make up to ten attempts to repeat, but it seems the save() either works the 
> first time or not at all.
> We're thinking there are environmental factors here but we're now focused on 
> getting this nailed. Any advice or assistance will be welcomed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to