Re: [iText-questions] Error extracting Japanese/Korean text from PDFusing iText 5.1.3

Paulo Soares Fri, 17 Feb 2012 00:25:27 -0800

That's probably already fixed i the SVN. Please post the PDF.

Paulo
  ----- Original Message ----- 
  From: Mary Aubaun 
  To: [email protected] 
  Sent: Friday, February 17, 2012 12:21 AM
  Subject: [iText-questions] Error extracting Japanese/Korean text from 
PDFusing iText 5.1.3



  I'm getting an error extracting text from a PDF using iText 5.1.3.  It works 
fine with PDFs that use Roman characters, fails on Japanese and Korean 
documents.  I searched the forums and found something similar, but not with the 
same error call stack, so I'm posting it.

  I can post the PDF, but would have to clean off any sensitive data in it 
first.  If I can send it to just a few people instead of posting it publicly, 
it would be easier.

  Call Stack:
  Feb 16, 2012 6:56:10 PM de1.TestPdfExtraction main
  SEVERE: null
  java.lang.ArrayIndexOutOfBoundsException: 38901
      at 
com.itextpdf.text.pdf.CMapAwareDocumentFont.getWidth(CMapAwareDocumentFont.java:182)
      at 
com.itextpdf.text.pdf.parser.TextRenderInfo.getStringWidth(TextRenderInfo.java:210)
      at 
com.itextpdf.text.pdf.parser.TextRenderInfo.getUnscaledWidth(TextRenderInfo.java:113)
      at 
com.itextpdf.text.pdf.parser.TextRenderInfo.getUnscaledBaselineWithOffset(TextRenderInfo.java:147)
      at 
com.itextpdf.text.pdf.parser.TextRenderInfo.getBaseline(TextRenderInfo.java:122)
      at 
com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy.renderText(LocationTextExtractionStrategy.java:154)
      at 
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(PdfContentStreamProcessor.java:303)
      at 
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$2500(PdfContentStreamProcessor.java:74)
      at 
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(PdfContentStreamProcessor.java:496)
      at 
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:246)
      at 
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:366)
      at 
com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
      at test1.TestPdfExtraction.main(utTest.java:4484)
  BUILD SUCCESSFUL (total time: 4 seconds)


  Here's my code:
  class TestPdfExtraction {

      public static void main(String args[]) throws IOException {
          com.itextpdf.text.pdf.PdfReader pdfReader = null;

          pdfReader = new 
com.itextpdf.text.pdf.PdfReader("d:/docs/00012001-0000.pdf");
          if (pdfReader != null) {
              try {
                  //pdfExtractor = new 
com.itextpdf.text.pdf.parser.PdfTextExtractor(pdfReader);
                  int pdfPageCount = pdfReader.getNumberOfPages();
                  com.itextpdf.text.pdf.parser.PdfReaderContentParser parser = 
new com.itextpdf.text.pdf.parser.PdfReaderContentParser(pdfReader);
                  com.itextpdf.text.pdf.parser.TextExtractionStrategy strategy;
                  for (int iPageNo = 1; iPageNo <= pdfPageCount; ++iPageNo) {
                      String pageText = "";
                      try {
                          //text = pdfExtractor.getTextFromPage(iPage);
                          strategy = parser.processContent(iPageNo, new 
com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy());
                          pageText = strategy.getResultantText();
                          utMisc.debugOutput("PageText[Page=" + iPageNo + "]=" 
+ pageText);
                      } catch (Throwable ex) {
                          
Logger.getLogger(TestPdfExtraction.class.getName()).log(Level.SEVERE, null, ex);
                          pageText = null; //"Error retrieving text from page 
#" + iPage + "\n";
                      }
                      pageText = utString.nzBlank(pageText);
                  }
              } finally {
                  if (pdfReader != null) {
                      pdfReader.close();
                  }
              }
          }

      }

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Error extracting Japanese/Korean text from PDFusing iText 5.1.3

Reply via email to