wwkloo, wwkloo wrote > I have a PDF with Asian font > iTextExtract_W.pdf > <http://itext-general.2136553.n4.nabble.com/file/n4657836/iTextExtract_W.pdf> > > > When I extract the text from it through TextRenderInfo.GetText() inside > RenderText() of an implemented ITextExtractionStrategy by calling > PdfTextExtractor.GetTextFromPage(), it returns an incorrect character for > character 1 and correct for character 2. However, when I extract it using > Acrobat Reader XI by copy and paste, both charcters are extracted > correctly. > > 1 is U+20547 > 2 is U+92DB
You seem to have forgotten to register with the mailing list. Thus, your question was only visible to those who follow the mailing list on nabble which is a small minority. That being said, let's look at your issue. Considering the method names you use you seem to be working with iTextSharp in .Net, not with iText in Java. I'm on the Java side, though, thus I inspected your file using Java. In Java char is a 16bit type; thus, one cannot expect text extraction to return that first character as 0x20547; instead the UTF16 representation might be expected, i.e. 0xD841 0xDD47. Thus, I applied iText text extraction to your file: PdfReader reader = new PdfReader(TEST_FILE.toString()); String text = PdfTextExtractor.getTextFromPage(reader, 1); for (char c: text.toCharArray()) { int i = c<0 ? Integer.MAX_VALUE + c : c; System.out.print("\\u"); System.out.print(Integer.toHexString(i)); } and retrieved: \u31\u20\ud841\udd47\u20\ua\u32\u20\u92db\u20 I.e. "\ud841\udd47" and "\u92db" for your Asian characters. So everything seems ok in Java. Does the situation differ in .Net? Regards, Michael -- View this message in context: http://itext-general.2136553.n4.nabble.com/Differences-btw-text-extraction-from-iText-and-Acrobat-Reader-tp4657836p4657844.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php