[iText-questions] Problem using PdfTextExtractor.getTextFromPage(), ArrayIndexOutOfBoundsException thrown by CMapAwareDocumentFont.decodeSingleCID

Sophia Cheng Wed, 09 Sep 2009 07:02:35 -0700

(Apologies for the duplicate posting, but I don't think my mail yesterday
was formatted correctly when I looked at SourceForge...it didn't show the
text below)


Thanks Paulo for answering my earlier question about accessing a secured
pdf.  I realize now that the error I was getting when trying to read the pdf
document with the PdfReader was NOT related to the fact that it was a
secured document.  I ran into this issue as well with another article that
was not secured.  Here is the exception I am getting:

java.lang.ArrayIndexOutOfBoundsException: Invalid index: 02
    at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID(Unknown
Source)
    at com.lowagie.text.pdf.CMapAwareDocumentFont.decode(Unknown Source)
    at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.decode(Unknown
Source)
    at
com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(Unknown
Source)
    at
com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(Unknown
Source)
    at
com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Unknown
Source)
    at
com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown
Source)
    at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown
Source)
    at com.skcheng.pdfExtraction.ExtractDOI.extract(ExtractDOI.java:64)

Here is the code in question:
public String extract() throws Exception {
        String doi = "";

        // loop through pages
        boolean found;

        try {

            // set up text extractor
            PdfTextExtractor extract = new PdfTextExtractor(reader);
            // compile the regex
            Pattern p = Pattern.compile(regExDoi);
            // get number of pages
            int numPages = reader.getNumberOfPages();
            System.out.println("Number of pages: "+numPages);
            found = false;
            for (int page = 1; page <= numPages & !found; page++) {
                System.out.println(page);
                // get text from the page
                String text = extract.getTextFromPage(page);

                // check each page for regexDoi
                Matcher m = p.matcher(text);
                if (m.find()) {
                    String foundIt = m.group();
                    // split at regexDoiSplit, will be String[] = {"", "the
doi numbers"}
                    doi = foundIt.split(regExDoiSplit)[1];
                    found = true;
                }
            }


        } finally {
            reader.close();
        }

        if (found) {
            return doi;
        } else {
            throw new Exception("Doi not found in file");
        }
}

Where reader is initialized to the pdf in the constructor.  Attached is a
file that is giving me this error.

This only occurs with some of the pdfs that I am using and not all.  Does
anyone know anything more about why this is being thrown and/or a possible
work around?  At the moment, to work around this problem I am using
PdfContentReaderTool.listContentStream, which throws a ExceptionConverter
for the same pdfs that have problems with the above code.  I am currently
ignoring this exception and then manually using regEx to go through the raw
data to extract the information I want, which is getting very messy.

Thank you again.

Sincerely,
Sophia

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~
Aim for the moon. If you miss, you may hit a star. -W. Clement Stone

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

[iText-questions] Problem using PdfTextExtractor.getTextFromPage(), ArrayIndexOutOfBoundsException thrown by CMapAwareDocumentFont.decodeSingleCID

Reply via email to