wwkloo,

you don't seem to have registered with the mailing list; I at least only saw
your message on nabble, not via mail.

That been said...

wwkloo wrote
> I am facing the problem that some PDF are displayed one way and the
> PdfTextExtractor.GetTextFromPage get different list of characters.
> 
> The following two PDFs are displayed the same in Acrobat Reader, but the
> extracted output are not.
> - ok.pdf is successful
> - failed.pdf is unsuccessful
> ok.pdf <http://itext-general.2136553.n4.nabble.com/file/n4657799/ok.pdf>  
> 
> failed.pdf
> <http://itext-general.2136553.n4.nabble.com/file/n4657799/failed.pdf>  

As you mention that the "PDFs are displayed the same in Acrobat Reader," you
surely also have tried copying and pasting from that software. Therefore,
you surely have seen that text from ok.pdf is correctly copied as "增補字集"
while the text from failed.pdf is copied as "增增增增". Thus, this obviously is
not a iText specific problem but a more generic one.

The problem actually is due to the /ToUnicode mapping of the respectively
used embedded font.

In case of ok.pdf you have:

4 beginbfrange
<0697><0697><5b57>
<1083><1083><96c6>
<11d6><11d6><88dc>
<13fa><13fa><589e>
endbfrange

Thus, the character identifier 0697 is mapped to 5b57, 1083 to 96c6, 11d6 to
88dc, and 13fa to 589e. These seem to be the correct mappings.

In case of failed.pdf on the other hand:

4 beginbfrange
<0697><0697><589e>
<1083><1083><589e>
<11d6><11d6><589e>
<13fa><13fa><589e>
endbfrange

Thus, all four character identifiers 0697, 1083, 11d6, and 13fa are mapped
to 589e.

So, failed.pdf contains a broken mapping cid-to-unicode, and, therefore,
text extraction must fail.

Regards,   Michael





--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Problem-extracting-text-from-PDFs-that-displayed-the-same-tp4657799p4657805.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to