[iText-questions] Error extracting Japanese/Korean text from PDF using iText 5.1.3

Mary Aubaun Thu, 16 Feb 2012 16:21:20 -0800

I'm getting an error extracting text from a PDF using iText 5.1.3.  It
works fine with PDFs that use Roman characters, fails on Japanese and
Korean documents.  I searched the forums and found something similar, but
not with the same error call stack, so I'm posting it.


I can post the PDF, but would have to clean off any sensitive data in it
first.  If I can send it to just a few people instead of posting it
publicly, it would be easier.

Call Stack:
Feb 16, 2012 6:56:10 PM de1.TestPdfExtraction main
SEVERE: null
java.lang.ArrayIndexOutOfBoundsException: 38901
    at
com.itextpdf.text.pdf.CMapAwareDocumentFont.getWidth(CMapAwareDocumentFont.java:182)
    at
com.itextpdf.text.pdf.parser.TextRenderInfo.getStringWidth(TextRenderInfo.java:210)
    at
com.itextpdf.text.pdf.parser.TextRenderInfo.getUnscaledWidth(TextRenderInfo.java:113)
    at
com.itextpdf.text.pdf.parser.TextRenderInfo.getUnscaledBaselineWithOffset(TextRenderInfo.java:147)
    at
com.itextpdf.text.pdf.parser.TextRenderInfo.getBaseline(TextRenderInfo.java:122)
    at
com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy.renderText(LocationTextExtractionStrategy.java:154)
    at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(PdfContentStreamProcessor.java:303)
    at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$2500(PdfContentStreamProcessor.java:74)
    at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(PdfContentStreamProcessor.java:496)
    at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:246)
    at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:366)
    at
com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
    at test1.TestPdfExtraction.main(utTest.java:4484)
BUILD SUCCESSFUL (total time: 4 seconds)


Here's my code:
class TestPdfExtraction {

    public static void main(String args[]) throws IOException {
        com.itextpdf.text.pdf.PdfReader pdfReader = null;

        pdfReader = new
com.itextpdf.text.pdf.PdfReader("d:/docs/00012001-0000.pdf");
        if (pdfReader != null) {
            try {
                //pdfExtractor = new
com.itextpdf.text.pdf.parser.PdfTextExtractor(pdfReader);
                int pdfPageCount = pdfReader.getNumberOfPages();
                com.itextpdf.text.pdf.parser.PdfReaderContentParser parser
= new com.itextpdf.text.pdf.parser.PdfReaderContentParser(pdfReader);
                com.itextpdf.text.pdf.parser.TextExtractionStrategy
strategy;
                for (int iPageNo = 1; iPageNo <= pdfPageCount; ++iPageNo) {
                    String pageText = "";
                    try {
                        //text = pdfExtractor.getTextFromPage(iPage);
                        strategy = parser.processContent(iPageNo, new
com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy());
                        pageText = strategy.getResultantText();
                        utMisc.debugOutput("PageText[Page=" + iPageNo +
"]=" + pageText);
                    } catch (Throwable ex) {

Logger.getLogger(TestPdfExtraction.class.getName()).log(Level.SEVERE, null,
ex);
                        pageText = null; //"Error retrieving text from page
#" + iPage + "\n";
                    }
                    pageText = utString.nzBlank(pageText);
                }
            } finally {
                if (pdfReader != null) {
                    pdfReader.close();
                }
            }
        }

    }

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

[iText-questions] Error extracting Japanese/Korean text from PDF using iText 5.1.3

Reply via email to