Re: [iText-questions] NPE while Extracting text

Leonard Rosenthol Mon, 21 Jun 2010 09:34:01 -0700

There are two ways to handle Type 3 encodings.

1) It's a newer Type3 and has an associated ToUnicode table - that's easy ;).


2) Use the name of the glyph (the key in the CharProcs table) against the Adobe 
Glyph List (<http://en.wikipedia.org/wiki/Adobe_Glyph_List>) which maps 
standard names to Unicode values.

Leonard

-----Original Message-----
From: Kevin Day [mailto:ke...@trumpetinc.com] 
Sent: Monday, June 21, 2010 5:52 PM
To: itext-questions@lists.sourceforge.net
Subject: Re: [iText-questions] NPE while Extracting text


The trick here is obtaining a mapping between the type 3 font glyphs and some
sort of encoded text.  There are several ways that this can be done, and
they are fairly well supported by the text parser - but type 3 fonts, as has
been mentioned, don't *usually* have this sort of mapping information.

I know a lot of the PDF specification, but I don't know all of it - and it's
quite possible that there is some mechanism for obtaining this sort of
mapping.  I guess the first thing to do is to ask whether Acrobat can figure
the text out for these fonts (can you hi-light the text, copy and paste it
into a text editor?).  If they can, then it's time to dig into the PDF spec
and figure out if there is some mapping strategy that isn't being handled by
CMapAwareDocumentFont.

What it sounds like to me is that the string that is passed into decode() is
actually correct.  Interestingly, looking at the font definition that you
provide, there is a dictionary entry for Encoding.  I think that this is
where careful reading of the PDF spec is going to be required - so here are
some resources to get you started:

Here's the spec:  http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf

Section 9.6.5 discusses type 3 font dictionaries.


I note that Type 3 fonts *can* have a ToUnicode entry.  And they have an
Encoding entry.  So these sure sound an aweful lot like Type 1 fonts as far
as text extraction is concerned.  From a debugging perspective, I think that
the next step is to do a debug walk through with a document containing
normal Type 1 font, and comparing that with the walkthough of your document
with Type 3 font.  You may find that there's something subtle that can be
tweaked to make this work.

Please let me know what you find!
-- 
View this message in context: 
http://itext-general.2136553.n4.nabble.com/NPE-while-Extracting-text-tp2256512p2262853.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] NPE while Extracting text

Reply via email to