I'm getting an error extracting text from a PDF using iText 5.1.3. It
works fine with PDFs that use Roman characters, fails on Japanese and
Korean documents. I searched the forums and found something similar, but
not with the same error call stack, so I'm posting it.
I can post the PDF, but would have to clean off any sensitive data in it
first. If I can send it to just a few people instead of posting it
publicly, it would be easier.
Call Stack:
Feb 16, 2012 6:56:10 PM de1.TestPdfExtraction main
SEVERE: null
java.lang.ArrayIndexOutOfBoundsException: 38901
at
com.itextpdf.text.pdf.CMapAwareDocumentFont.getWidth(CMapAwareDocumentFont.java:182)
at
com.itextpdf.text.pdf.parser.TextRenderInfo.getStringWidth(TextRenderInfo.java:210)
at
com.itextpdf.text.pdf.parser.TextRenderInfo.getUnscaledWidth(TextRenderInfo.java:113)
at
com.itextpdf.text.pdf.parser.TextRenderInfo.getUnscaledBaselineWithOffset(TextRenderInfo.java:147)
at
com.itextpdf.text.pdf.parser.TextRenderInfo.getBaseline(TextRenderInfo.java:122)
at
com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy.renderText(LocationTextExtractionStrategy.java:154)
at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(PdfContentStreamProcessor.java:303)
at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$2500(PdfContentStreamProcessor.java:74)
at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(PdfContentStreamProcessor.java:496)
at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:246)
at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:366)
at
com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
at test1.TestPdfExtraction.main(utTest.java:4484)
BUILD SUCCESSFUL (total time: 4 seconds)
Here's my code:
class TestPdfExtraction {
public static void main(String args[]) throws IOException {
com.itextpdf.text.pdf.PdfReader pdfReader = null;
pdfReader = new
com.itextpdf.text.pdf.PdfReader("d:/docs/00012001-0000.pdf");
if (pdfReader != null) {
try {
//pdfExtractor = new
com.itextpdf.text.pdf.parser.PdfTextExtractor(pdfReader);
int pdfPageCount = pdfReader.getNumberOfPages();
com.itextpdf.text.pdf.parser.PdfReaderContentParser parser
= new com.itextpdf.text.pdf.parser.PdfReaderContentParser(pdfReader);
com.itextpdf.text.pdf.parser.TextExtractionStrategy
strategy;
for (int iPageNo = 1; iPageNo <= pdfPageCount; ++iPageNo) {
String pageText = "";
try {
//text = pdfExtractor.getTextFromPage(iPage);
strategy = parser.processContent(iPageNo, new
com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy());
pageText = strategy.getResultantText();
utMisc.debugOutput("PageText[Page=" + iPageNo +
"]=" + pageText);
} catch (Throwable ex) {
Logger.getLogger(TestPdfExtraction.class.getName()).log(Level.SEVERE, null,
ex);
pageText = null; //"Error retrieving text from page
#" + iPage + "\n";
}
pageText = utString.nzBlank(pageText);
}
} finally {
if (pdfReader != null) {
pdfReader.close();
}
}
}
}
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php