Hello,
I met with a PDF file which does not embed font subsets and consequently failed
to extract text from it.
I checked out the corresponding font dictionary and found the following things
(indirect references are changed to embedded PDF objects in the following
pseudo code and dictionary items are shown as /Name=VALUE):
<<C2_0
/Type=/Font
/BaseFont=/MingLiU
/SubType=/CIDFontType2
<<CIDSystemInfo
/Ordering="CNS1"
/Registry="Adobe"
/Supplement=3
>>
<<FontDesciptor
/FontName=/MingLiU
/Lang=/zh-TW
/FontFamily=/MingLiU
% there's no FontFile2 or other embedded font resource in the FontDescriptor
dictionary
>>
How to extract text from such kind of document?
There's an Asian pack in SourceForge which contains a CNS1 file, and maybe
useful for such kind of font decoding jobs, but I don't know how to utilize
that pack. Anyone has ever done something similar please give me a hint.
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php