(Apologies for the duplicate posting, but I don't think my mail yesterday
was formatted correctly when I looked at SourceForge...it didn't show the
text below)
Thanks Paulo for answering my earlier question about accessing a secured
pdf. I realize now that the error I was getting when trying to read the pdf
document with the PdfReader was NOT related to the fact that it was a
secured document. I ran into this issue as well with another article that
was not secured. Here is the exception I am getting:
java.lang.ArrayIndexOutOfBoundsException: Invalid index: 02
at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID(Unknown
Source)
at com.lowagie.text.pdf.CMapAwareDocumentFont.decode(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.decode(Unknown
Source)
at
com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(Unknown
Source)
at
com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(Unknown
Source)
at
com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Unknown
Source)
at
com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown
Source)
at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown
Source)
at com.skcheng.pdfExtraction.ExtractDOI.extract(ExtractDOI.java:64)
Here is the code in question:
public String extract() throws Exception {
String doi = "";
// loop through pages
boolean found;
try {
// set up text extractor
PdfTextExtractor extract = new PdfTextExtractor(reader);
// compile the regex
Pattern p = Pattern.compile(regExDoi);
// get number of pages
int numPages = reader.getNumberOfPages();
System.out.println("Number of pages: "+numPages);
found = false;
for (int page = 1; page <= numPages & !found; page++) {
System.out.println(page);
// get text from the page
String text = extract.getTextFromPage(page);
// check each page for regexDoi
Matcher m = p.matcher(text);
if (m.find()) {
String foundIt = m.group();
// split at regexDoiSplit, will be String[] = {"", "the
doi numbers"}
doi = foundIt.split(regExDoiSplit)[1];
found = true;
}
}
} finally {
reader.close();
}
if (found) {
return doi;
} else {
throw new Exception("Doi not found in file");
}
}
Where reader is initialized to the pdf in the constructor. Attached is a
file that is giving me this error.
This only occurs with some of the pdfs that I am using and not all. Does
anyone know anything more about why this is being thrown and/or a possible
work around? At the moment, to work around this problem I am using
PdfContentReaderTool.listContentStream, which throws a ExceptionConverter
for the same pdfs that have problems with the above code. I am currently
ignoring this exception and then manually using regEx to go through the raw
data to extract the information I want, which is getting very messy.
Thank you again.
Sincerely,
Sophia
--
~~~~~~~~~~~~~~~~~~~~~~~~~
Aim for the moon. If you miss, you may hit a star. -W. Clement Stone
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions:
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/