[jira] Commented: (PDFBOX-202) Error on text extraction: java.lang.IndexOutOfBoundsExceptio

Adam Nichols (JIRA) Mon, 27 Dec 2010 16:29:08 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975395#action_12975395
 ]


Adam Nichols commented on PDFBOX-202:
-------------------------------------

First, I tested ExtractText.main(new String[] 
{"C:\\Temp\\PDFBOX-202\\mozambique.pdf"}); and it did not throw any exceptions 
with the current HEAD tag (this includes two patches I made today for 
protecting against NPE).  So this is fixed in the current head tag.

No text is extracted in the txt file, but since Adobe Acrobat Standard 8, this 
is expected.  It's a corrupt PDF, so there's not much we can do with it, but 
it's good that it doesn't throw an exception anymore.

> Error on text extraction: java.lang.IndexOutOfBoundsExceptio
> ------------------------------------------------------------
>
>                 Key: PDFBOX-202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: mozambique.pdf
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1565617
> Originally submitted by gagravarr on 2006-09-26 03:30.
> I'm trying to extract text from a pdf file
> (http://www.cifor.cgiar.org/mla/download/publication/mozambique.pdf),
> but I'm getting an IndexOutOfBoundsException on it:
> Exception in thread "main"
> java.lang.IndexOutOfBoundsException: Index: 4, Size: 4
>         at
> java.util.ArrayList.RangeCheck(ArrayList.java:546)
>         at java.util.ArrayList.get(ArrayList.java:321)
>         at
> org.pdfbox.util.operator.Concatenate.process(Concatenate.java:69)
>         at
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:494)
>         at
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:207)
>         at
> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160)
>         at
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355)
>         at
> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268)
>         at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220)
>         at
> org.pdfbox.ExtractText.main(ExtractText.java:237)
> I've tried with 0.7.2, and 0.7.3-dev-20060920, and I
> get the same exception from both versions.
> Nick

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-202) Error on text extraction: java.lang.IndexOutOfBoundsExceptio

Reply via email to