[ 
https://issues.apache.org/jira/browse/PDFBOX-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649156#comment-13649156
 ] 

James Green commented on PDFBOX-1586:
-------------------------------------

I looked at this today. I now understand how we experience this problem.

We have a class, Divider, which has a number of methods. We'll start at method 
a() which accepts a ByteArrayInputStream that holds a PDF. a() calls splitter() 
handing it this ByteArrayInputStream and expecting a list of PDDocuments back.

class Divider {

public List<byte[]> a(ByteArrayInputStream pdfBytes) {

List<PDDocument> split = splitter(pdfBytes);
List<byte[]> retVal = new ArrayList<byte[]>();
for (PDDocument p : split) {
ByteArrayOutputStream os = new ByteArrayOutputStream();
p.save(os);
retVal.add(os.toByteArray());
}
return retVal;
}

public List<PDDocument> splitter(ByteArrayInputStream masterPdf) {
List<PDDocument> retVal = new ArrayList<PDDocument>();
PDDocument doc = PDDocument.load(masterPdf);
List<PDPage> pages = doc.getDocumentCatalog().getAllPages();

// Iterate over the pages and import each page into new PDDocuments added to 
retVal
return retVal;
}
}

Because splitter internally creates new PDDocument and performs importPage() 
referencing the individual pages from the master document, the master document 
falls of of scope the moment the splitter's work is done. Not unreasonable.

The trouble is that importPage passes the new PDPage a reference to the 
original's scratchFile. So the moment the GC clears up the master document 
having returned to a() the scratchFile is closed, causing a()'s saving to 
crash. All blindingly obvious when you realise the scratchFile is being copied 
inside the importPage routine which most people might expect would perform a 
clean clone.

That hopefully concludes things. We can of course re-work our code to avoid the 
bug but it would be sensible to make importPage perform a proper clone at some 
point.

                
> IndexOutOfBoundsException when saving a document (at random)
> ------------------------------------------------------------
>
>                 Key: PDFBOX-1586
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1586
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.1
>            Reporter: James Green
>            Priority: Critical
>
> Getting the following stacktrace:
> org.apache.pdfbox.exceptions.COSVisitorException: 
> java.lang.IndexOutOfBoundsException: Index: 28, Size: 0
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1245)
>     at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:201)
>     at org.apache.pdfbox.cos.COSObject.accept(COSObject.java:206)
>     at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:524)
>     at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:434)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1056)
>     at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:496)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1392)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1157)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1138)
> ...
> Caused by: java.lang.IndexOutOfBoundsException: Index: 28, Size: 0
>     at java.util.ArrayList.rangeCheck(ArrayList.java:604)
>     at java.util.ArrayList.get(ArrayList.java:382)
>     at 
> org.apache.pdfbox.io.RandomAccessBuffer.seek(RandomAccessBuffer.java:84)
>     at 
> org.apache.pdfbox.io.RandomAccessFileInputStream.read(RandomAccessFileInputStream.java:96)
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1232)
> I'll add some context. We have a "data pipeline" in which a Windows Print 
> Monitor sends postscript into a servlet which then uses GhostScript 9.05 to 
> convert in-memory to PDF. This PDF is then loaded into PDFBox using 
> PDDocument.load().
> At this point we split the original PDF into multiple smaller ones each of 
> which is saved to a ByteArrayOutputStream. At the point of save() we are 
> having serious reliability issues.
> Taking an original PDF from Ghostscript we have saved this into a unit test 
> to replicate the problem without success. If we attempt to re-execute the 
> pipeline to take the original PDF and split it, we get apparently random 
> percentages of saved documents.
> For instance, on a 990 page document (text, no images), to be split into 990 
> 1-page documents using Tomcat 7 with -Xmx=512m:
> Pass 1: 50% were saved, 50% ended with stack traces
> Pass 2: 100% were saved
> Pass 3: 100% were saved
> The same test with -Xmx=128m ended several times with just 1 document saved, 
> the rest were stack traces.
> We have also seen this randomly hit a sample document consisting of four 
> pages to be split into two two-page documents so it does not appear to be 
> memory related. We also added code to catch the IndexOutOfBoundsException and 
> make up to ten attempts to repeat, but it seems the save() either works the 
> first time or not at all.
> We're thinking there are environmental factors here but we're now focused on 
> getting this nailed. Any advice or assistance will be welcomed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to