wwkloo wrote > > mkl wrote >> wwkloo, >> wwkloo wrote >>> I have a PDF with Asian font >>> iTextExtract_W.pdf >>> <http://itext-general.2136553.n4.nabble.com/file/n4657836/iTextExtract_W.pdf> >>> >>> >>> When I extract the text from it through TextRenderInfo.GetText() inside >>> RenderText() of an implemented ITextExtractionStrategy by calling >>> PdfTextExtractor.GetTextFromPage(), it returns an incorrect character >>> for character 1 and correct for character 2. However, when I extract it >>> using Acrobat Reader XI by copy and paste, both charcters are extracted >>> correctly. >>> >>> 1 is U+20547 >>> 2 is U+92DB >> You seem to have forgotten to register with the mailing list. Thus, your >> question was only visible to those who follow the mailing list on nabble >> which is a small minority. >> >> That being said, let's look at your issue. >> >> Considering the method names you use you seem to be working with >> iTextSharp in .Net, not with iText in Java. I'm on the Java side, though, >> thus I inspected your file using Java. > Yes, I am working with iTextSharp in .Net. > mkl wrote >> In Java char is a 16bit type; thus, one cannot expect text extraction to >> return that first character as 0x20547; instead the UTF16 representation >> might be expected, i.e. 0xD841 0xDD47. >> >> Thus, I applied iText text extraction to your file: >> >> PdfReader reader = new PdfReader(TEST_FILE.toString()); >> String text = PdfTextExtractor.getTextFromPage(reader, 1); >> for (char c: text.toCharArray()) >> { >> int i = c<0 ? Integer.MAX_VALUE + c : c; >> System.out.print("\\u"); >> System.out.print(Integer.toHexString(i)); >> } >> >> and retrieved: >> >> \u31\u20\ud841\udd47\u20\ua\u32\u20\u92db\u20 >> >> I.e. "\ud841\udd47" and "\u92db" for your Asian characters. >> >> So everything seems ok in Java. Does the situation differ in .Net? >> >> Regards, Michael > Thanks for the code and try. > I followed to try similar things in .Net. With UTF16, the 1st Asian > character returned is 0xFFFD. The 2nd is correct. > > === C# === > PdfReader rdr = new PdfReader(ofdFile.FileName); > ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); > String txt = PdfTextExtractor.GetTextFromPage(rdr, 1, strategy); > byte[] bs16 = Encoding.Unicode.GetBytes(txt); > foreach (byte b in bs16) > { > Console.Write("{0:X2} ", b); > } > Console.Write("\n"); > === C# === > > === OUTPUT === > 31 00 20 00 FD FF 20 00 0A 00 32 00 20 00 DB 92 20 00 > === OUTPUT === > > Regards
Additional information: When create the PDF with another program, the text can be extracted by iText and Acrobat Reader XI correctly. - 1: 0xD841 0xDD47 - 2: 0x92DB However, the character is not displayed correctly. :( iTextExtract_O.pdf <http://itext-general.2136553.n4.nabble.com/file/n4657858/iTextExtract_O.pdf> Please help! Regards wwkloo -- View this message in context: http://itext-general.2136553.n4.nabble.com/Differences-btw-text-extraction-from-iText-and-Acrobat-Reader-tp4657836p4657858.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php