[jira] [Commented] (PDFBOX-2920) IndexOutOfBounds Exception when loading large PDF

Maruan Sahyoun (JIRA) Sun, 09 Aug 2015 03:10:00 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679077#comment-14679077
 ]


Maruan Sahyoun commented on PDFBOX-2920:
----------------------------------------

we should support PDFs being larger than 2GB as this is not a limitation for 
PDFs (there are some others because of the spec).

Looking at the code quickly it should still work wo the test as the 
RandomAccessBuffer works with multiple chunks (which are limited to 
Integer.MAX_VALUE) and the pointers are already longs. There are some other 
areas where int has to be replaced by long such as {{available()}} and 
((rewind()}}. 

> IndexOutOfBounds Exception when loading large PDF
> -------------------------------------------------
>
>                 Key: PDFBOX-2920
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2920
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.8, 1.8.9, 1.8.10
>         Environment: Software
>            Reporter: Brad Baker
>              Labels: parser
>
> I'm getting exceptions loading large pdfs (~6-8 GB each). I've tried using 
> PDDocument.load() and PDDocument.loadNonSeq(). I can't attach a PDF due to 
> the file size limit of 10 Mb. If there is another way to get it to someone, I 
> can work that out. Here is my code:
>       
>       public static void main(String[] args) {
>               
>               LOGGER.info("Test Large PDF Load " + TEST_PDF);
>               try {
>                       LOGGER.info("Create Steam");
>                       InputStream is = new FileInputStream(TEST_PDF);
>                       LOGGER.info("Start Load");
>                       PDDocument doc = PDDocument.load(is);
> //                    PDDocument doc = PDDocument.loadNonSeq(is, null);
>                       LOGGER.info("Finished Load");
>                       doc.close();
>                       is.close();
>               } catch (IOException e) {
>                       e.printStackTrace();
>               }
>       }
> This first error is using PDDocument.load()
> Aug 06, 2015 1:31:14 PM hp.pdfbox.test.Main main
> INFO: Test Large PDF Load D:\workspace_trunk_luna\test_pdfbox\pdfs\ELOISA 
> ARTOLA CD17433_Indigo.pdf
> Aug 06, 2015 1:31:14 PM hp.pdfbox.test.Main main
> INFO: Create Steam
> Aug 06, 2015 1:32:44 PM hp.pdfbox.test.Main main
> INFO: Start Load
> org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:278)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
> at hp.pdfbox.test.Main.main(Main.java:22)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 1041, Size: 1041
> at java.util.ArrayList.rangeCheck(Unknown Source)
> at java.util.ArrayList.get(Unknown Source)
> at org.apache.pdfbox.io.RandomAccessBuffer.seek(RandomAccessBuffer.java:110)
> at 
> org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:106)
> at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
> at java.io.BufferedOutputStream.flush(Unknown Source)
> at java.io.FilterOutputStream.close(Unknown Source)
> at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:616)
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:650)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 3 more
> This error was using PDDocument.loadNonSeq()
> INFO: Create Steam
> Aug 06, 2015 1:51:47 PM hp.pdfbox.test.Main main
> INFO: Start Load
> Aug 06, 2015 1:53:39 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver 
> setStartxref
> WARNING: Did not found XRef object at specified startxref position 8552119825
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 509, 
> Size: 509
> at java.util.ArrayList.rangeCheck(Unknown Source)
> at java.util.ArrayList.get(Unknown Source)
> at org.apache.pdfbox.io.RandomAccessBuffer.seek(RandomAccessBuffer.java:110)
> at 
> org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:106)
> at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
> at java.io.BufferedOutputStream.flush(Unknown Source)
> at java.io.FilterOutputStream.close(Unknown Source)
> at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseCOSStream(NonSequentialPDFParser.java:1847)
> at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1448)
> at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1374)
> at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:1348)
> at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:429)
> at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:915)
> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1305)
> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1288)
> at hp.pdfbox.test.Main.main(Main.java:22) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2920) IndexOutOfBounds Exception when loading large PDF

Reply via email to