mkl wrote > wwkloo, > wwkloo wrote >> I have a PDF with Asian font >> iTextExtract_W.pdf >> <http://itext-general.2136553.n4.nabble.com/file/n4657836/iTextExtract_W.pdf> >> >> >> When I extract the text from it through TextRenderInfo.GetText() inside >> RenderText() of an implemented ITextExtractionStrategy by calling >> PdfTextExtractor.GetTextFromPage(), it returns an incorrect character for >> character 1 and correct for character 2. However, when I extract it using >> Acrobat Reader XI by copy and paste, both charcters are extracted >> correctly. >> >> 1 is U+20547 >> 2 is U+92DB > You seem to have forgotten to register with the mailing list. Thus, your > question was only visible to those who follow the mailing list on nabble > which is a small minority. > > That being said, let's look at your issue. > > Considering the method names you use you seem to be working with > iTextSharp in .Net, not with iText in Java. I'm on the Java side, though, > thus I inspected your file using Java.
Yes, I am working with iTextSharp in .Net. mkl wrote > In Java char is a 16bit type; thus, one cannot expect text extraction to > return that first character as 0x20547; instead the UTF16 representation > might be expected, i.e. 0xD841 0xDD47. > > Thus, I applied iText text extraction to your file: > > PdfReader reader = new PdfReader(TEST_FILE.toString()); > String text = PdfTextExtractor.getTextFromPage(reader, 1); > for (char c: text.toCharArray()) > { > int i = c<0 ? Integer.MAX_VALUE + c : c; > System.out.print("\\u"); > System.out.print(Integer.toHexString(i)); > } > > and retrieved: > > \u31\u20\ud841\udd47\u20\ua\u32\u20\u92db\u20 > > I.e. "\ud841\udd47" and "\u92db" for your Asian characters. > > So everything seems ok in Java. Does the situation differ in .Net? > > Regards, Michael Thanks for the code and try. I followed to try similar things in .Net. With UTF16, the 1st Asian character returned is 0xFFFD. The 2nd is correct. === C# === PdfReader rdr = new PdfReader(ofdFile.FileName); ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); String txt = PdfTextExtractor.GetTextFromPage(rdr, 1, strategy); byte[] bs16 = Encoding.Unicode.GetBytes(txt); foreach (byte b in bs16) { Console.Write("{0:X2} ", b); } Console.Write("\n"); === C# === === OUTPUT === 31 00 20 00 FD FF 20 00 0A 00 32 00 20 00 DB 92 20 00 === OUTPUT === Regards -- View this message in context: http://itext-general.2136553.n4.nabble.com/Differences-btw-text-extraction-from-iText-and-Acrobat-Reader-tp4657836p4657853.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php