wwkloo, you don't seem to have registered with the mailing list; I at least only saw your message on nabble, not via mail.
That been said... wwkloo wrote > I am facing the problem that some PDF are displayed one way and the > PdfTextExtractor.GetTextFromPage get different list of characters. > > The following two PDFs are displayed the same in Acrobat Reader, but the > extracted output are not. > - ok.pdf is successful > - failed.pdf is unsuccessful > ok.pdf <http://itext-general.2136553.n4.nabble.com/file/n4657799/ok.pdf> > > failed.pdf > <http://itext-general.2136553.n4.nabble.com/file/n4657799/failed.pdf> As you mention that the "PDFs are displayed the same in Acrobat Reader," you surely also have tried copying and pasting from that software. Therefore, you surely have seen that text from ok.pdf is correctly copied as "增補字集" while the text from failed.pdf is copied as "增增增增". Thus, this obviously is not a iText specific problem but a more generic one. The problem actually is due to the /ToUnicode mapping of the respectively used embedded font. In case of ok.pdf you have: 4 beginbfrange <0697><0697><5b57> <1083><1083><96c6> <11d6><11d6><88dc> <13fa><13fa><589e> endbfrange Thus, the character identifier 0697 is mapped to 5b57, 1083 to 96c6, 11d6 to 88dc, and 13fa to 589e. These seem to be the correct mappings. In case of failed.pdf on the other hand: 4 beginbfrange <0697><0697><589e> <1083><1083><589e> <11d6><11d6><589e> <13fa><13fa><589e> endbfrange Thus, all four character identifiers 0697, 1083, 11d6, and 13fa are mapped to 589e. So, failed.pdf contains a broken mapping cid-to-unicode, and, therefore, text extraction must fail. Regards, Michael -- View this message in context: http://itext-general.2136553.n4.nabble.com/Problem-extracting-text-from-PDFs-that-displayed-the-same-tp4657799p4657805.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php