Re: [iText-questions] extracting text from pdfs with japanese data

Kevin Day Wed, 17 Dec 2008 14:37:43 -0800

I'm talking about the DocumentFont.metrics member variable that gets populated by the call to DocumentFont.fillMetrics(), but I see what you are saying. metrics appears to be a map that maps the unicode value to two things: 1. the bytes that will actually be written to the content stream and, 2. the character width.

For purposes of parsing a content stream and extracting text, we actually need a map that maps the bytes in the content stream to two things: 1. The unicode value, and 2. the character width. (This is pretty much a reverse of the metrics map).

I am currently achieving this alternate map in a slightly backwards way. I use a CMap to map from content stream bytes to the unicode value. I then ask DocumentFont for the width of that unicode character. For width determination, I am effectively converting bytes -> unicode -> bytes -> width, instead of just converting bytes->width. Not strictly a flaw, but definitely not efficient. This is an area for future improvement, but not as fundamental to the current problem as I originally thought.

Coming back to the question at hand...

Does the following algorithm seem to be correct for determining cmap info?

1. Check to see if ToUnicode is specified. If so, use it.

2. Check sub-type and encoding - if it is TYPE0 and Identity-H, if so:

a. read CIDSystemInfo, then read it's Ordering

b. translate Ordering into an appropriate cmap filename (I know that Japan1 -> UniJIS-UTF16-H - but I'm at a total loss as to how you know that, other than reading all of the CMap files available until you find one that has the right ordering specified??)

c. read the cmap file, etc...

As you can see, I'm getting hung up on step (b). I could certainly just toss a hand coded map at the problem, but it doesn't feel like the right solution to me.

I'm also concerned about my bytes -> unicode transformation requirement for non-type0 fonts (like the CJK stuff I see all over the place in DocumentFont)...

- K

----------------------- Original Message -----------------------

From: Paulo Soares <psoa...@glintt.com>

To: Post all your questions about iText here <itext-questions@lists.sourceforge.net>

Cc:

Date: Wed, 17 Dec 2008 19:54:02 +0000

Subject: Re: [iText-questions] extracting text from pdfs with japanese data

> -----Original Message-----
> From: Kevin Day [mailto:ke...@trumpetinc.com]
> Sent: Wednesday, December 17, 2008 7:40 PM
> To: IText Questions
> Subject: Re: [iText-questions] extracting text from pdfs with
> japanese data
>
> So far so good - but how do we figure out that UniJIS-UTF16-H
> is the correct ordering for Japan1? Or is that just a piece
> of hard coded knowledge? Is that information available in
> the font definition itself somehow?
>

That's in the cmap, in this case:

/CIDSystemInfo 3 dict dup begin
/Registry (Adobe) def
/Ordering (Japan1) def
/Supplement 5 def
end def

but the cmap names can be hardcoded, the same cmaps will always be used to convert from CID to Unicode.

> If we have an appropriate cmap resource file, then we should
> ; be able to construct a CMap object and use it when filling
> metrics, computing widths, etc...
>
> Is there any oposition to introducing CMap into DocumentFont
> as an alternative to metrics? My thought would be to have
> DocumentFont have two member variables:
>
> widths
> cMap
>
> instead of the metrics table.

The metrics have nothing to do with the cmaps and are always present in the font dictionary even if only with the /DW.

Paulo

>
> While we are at it, it seems to me that the way that widths
> are handled should probably be looked at. The widths map is
> very efficient. When it gets put into metrics, it expands
> considerably... Would it maybe be better to hold onto the
> widths map and compute the width instead of using the metrics
> lookup table? For that matter, perhaps the widths should be
> encapsulated into their own object with a simple getWidth()
> operation on it... That would allow for different width
> capturing strategies as needed.
>
> - K
>
> ----------------------- Original Message -----------------------
>
> From: Paulo Soares <psoa...@glintt.com> <mailto:psoa...@glintt.com>
> To: Post all your questions about iText here
> <itext-questions@lists.sourceforge.net>
> <mailto:itext-questions@lists.sourceforge.net>
> Cc:
> Date: Wed, 17 Dec 2008 19:20:18 +0000
> Subject: Re: [iText-questions] extracting text from pdfs with
> japanese data
>
> Some quick pointers:
>
> - Identity-H means that the codepoints in the content match
> the CID characters. To know what is what a look at
> /CIDSystemInfo is needed to know the /Ordering, in this case Japan1.
>
> - UniJIS-UTF16-H matches this ordering and we just need to
> use it as a reversed ToUnicode to get the corresponding
> Unicode. In general if the encoding is a cmap it will give us
> the translation to CID (Identity-H is the special case where
> the translation is 1-1) and it's with this CID that we must
> reverse it to go to Unicode.
>
> Paulo
>
> > -----Original Message-----
> > From: Kevin Day [mailto:ke...@trumpetinc.com]
> <mailto:ke...@trumpetinc.com>
> > Sent: Wednesday, December 17, 2008 6:35 PM
> ; > To: IText Questions
> > Subject: Re: [iText-questions] extracting text from pdfs with
> > japanese data
> >
> > OK - we know that content1.pdf is choking b/c of the embedded
> > images. To fix that, the PdfContentParser class would have
> > to be enhanced to properly handle those. This is outside of
> > my scope, but maybe someone else on the list is up for trying
> > to add it.
> >
> > There is a lot of content in tic_doug2.pdf, which makes it
> > hard to develop a decent use case - as things progress, it
> > may be appropriate to generate a sample PDF that contains
> > just one font and some pretty simple content (maybe two or
> > three lines of text with 4 or 5 words each). But given the
> > PDF we have now, the thing that is failing (at least for me)
> > is loading of one of the fonts (F1).
> >> > Here is the info from the fonts dictionary:
> >
> > Subdictionary /F1 = (/DescendantFonts=[4 0 R],
> > /BaseFont=/GothicBBB-Medium-Identity-H, /Type=/Font,
> > /Encoding=/Identity-H, /Subtype=/Type0)
> >
> >
> > DocumentFont rec ognizes this as a Type0 font, and attempts to
> > parse it. However, it chokes b/c the ToUnicode map is not
> > specified for the font. As we know, this is expected,
> > because it should be possible to obtain a CMap from elsewhere
> > in the system, or intuit the CMap based on information in the
> > font (is that right??). I have a patch (below) that attempts
> > to work around this requirement. It's not pretty, but it
> > should tell us if we are on the right track or not.
> >
> >
> > So there are two pieces that have to be in place for this to work:
> >
> > 1. Ability to find the appropriate CMap resource
> > 2. Ability to load and use the CMap
> >
> > I'll take each in turn:
> >
> >
> > Finding the appropriate CMap resource
> >
> > I'm not at all experienced with font dictionaries. But my
> > ; initial thinking is that the F1 font listed above probably
> > does not require an external CMap - the encoding specified is
> > Identify-H, so shouldn't that be the CMap we use? And isn't
> > that CMap basically just a one-to-one mapping of characters
> > to bytes? Or are CMaps intended as a supplimentary mapping
> > technology, so is it more appropriate to say there is no
> > CMap, so we shouldn't be trying to fill font metrics at all?
> >
> > I've put together a patch for DocumentFont that just assumes
> > an identity mapping when there is no ToUnicode entry provided
> > (if it would be better to handle this sort of thing in an SVN
> > branch, please let me know. This patch removes the null
> > pointer exception, and attempts to build an appropriate
> > metrics map. Note that this implementation is horrendously
> > inneficient ( we wind up creating a metrics map that has <
> BR>> every potential character in it) - but it should at least
> > prove out the concept.
> >
> > If this is indeed what needs to be done in cases where an
> > Identity-H CMap is appropriate, then we can optimize... Be
> > sure to read the section that appears after the following
> > patch (outlining changes to DocumentFont to have it work with
> > CMaps directly) - this is the beginning of the change
> > required to handle the non-tounicode situation efficiently
> > (withou t storing a map essentially saying 1=1, 2=2, 3=3, ...
> > 0xffff = 0xffff).
> >
> > With this patch in place, my text content parser does not
> > fail - it produces text strings. I do not have the font in
> > question on my system, so I can't tell if the actual
> > extracted text is correct or not - perhaps Michael can apply
> > the patch and run the tic_doug2.pdf file through
> > PdfContentReader Tool and see what it gives him?
> >
> >
> >
> > ### Eclipse Workspace Patch 1.0
> > #P iText Trunk
> > Index: src/core/com/lowagie/text/pdf/DocumentFont.java
> > ===================================================================
> > --- src/core/com/lowagie/text/pdf/DocumentFont.java (revision 3613)
> > +++ src/core/com/lowagie/text/pdf/DocumentFont.java (working copy)
> > @@ -155,7 +155,8 @@
> >
> &g t;      private void processType0(PdfDictionary font) {
> >          try {
> > -            byte[] touni =
> > PdfReader.getStreamBytes((PRStream)PdfReader.getPdfObjectRelea
> > se(font.get(PdfName.TOUNICODE)));
> > +            PdfObject toUnicodeReference =
> > font.get(PdfName. TOUNICODE);
> > +            byte[] touni = toUnicodeReference == null ? null
> > : PdfReader.getStreamBytes((PRStream
> > )PdfReader.getPdfObjectRelease(toUnicodeReference));
> >              PdfArray df =
> > (PdfArray)PdfReader.getPdfObjectRelease(font.get(PdfName.DESCE
> > NDANTFONTS));
> >              PdfDictionary cidft =
> > (PdfDictionary)PdfReader.getPdfObjectRelease((PdfObject)df.get
> ArrayList().get(0));
> >              PdfNumber dwo =
> > (PdfNumber)PdfReader.getPdfObjectRelease(cidft.get(PdfName.DW));
> > @@ -204,6 +205,20 @@
> >      }
> >
> >      private void fillMetrics (byte[] touni, IntHashtable
> > widths, int dw) {
> > +        if (touni == null){ // just assume a one-to-one mapping
> > +            // this is hideously inefficient - much better
> > to use a CMap object
> > + ;
> > +           & nbsp;for(int i = 0; i <= 0xffff; i++){
> > +                int unic = i;
> > +                int w = dw;
> > +                if (widths.containsKey(unic))
> > +                    w = widths.get(i);
> > +                metrics.put(new Integer(unic), new int[]{unic, w});
> > +            }
> > +
> > +       & nbsp;    return;
> > +        }
> > +
> &g t;          try {
> >              PdfContentParser ps = new PdfContentParser(new
> > PRTokeniser(touni));
> >              PdfObject ob = null;
> >
> >
> >
> > Loading and using the CMap
> >
> > I think that the correct way to solve this is to make some
> > changes to DocumentFont.fillMetrics so that it works with
> > CMap objects instead of the unicode byte array. The CMap
> > object would be obtained (or at least attempted) during
> > construction of DocumentFont (regardless of the type of font!).
> >
> > This will actually drastically simplify fillMetrics (this
> > kind of parsing probably doesn't belong inside a font object
> > anyway), and i t will make the CMap available via the
> > DocumentFont directly. If we do this, then the need for
> > CMapAwareDocumentFont is removed entirely, and we can remove
> > that class.
> >
> >
> > To keep things efficient, I could use some help with the
> > change to fillMetrics, specifically how to fill the 'metrics'
> > map instance variable given a CMap object - or does CMap
> > remove the need for the metrics map entirely? Maybe the map
> > should just be a 'widths' map? If it removes the need
> > entirely, then I think adjusting DocumentFont to just use
> > CMap makes the DocumentFont source much more readable.
> >
> >
> > So, I think I see a way forward here - but I'm going to need
> > some help from the iText maintainers. Here's what I see as
> > how to proceed:
> >
> > 1. ad d a loadCMap() method to DocumentFont (the results of
> > the 'finding the appropriate CMap resource' discussion above
> > will be used for this)
> > 2. call loadCMap() from DocumentFont(PRIndirectReference refFont)
> > 3. adjust both getWidth() methods so they use CMap instead
> of metrics
> > 4. adjust both convertToBytes() methods so they use CMap
> > instead of metrics
> > 5. adjust charExists() so it uses CMap instead of metrics
> > 6. add getCMap() to DocumentFont
> > 7. move CMapAwareDocumentFont.encode() to DocumentFont (if
> > this is undesirable, we could continue to have this method in
> > a sub-class - it just seems more intuitive to have it in
> DocumentFont)
> >
> > once that is in place, I can:
> >
> > 8. adjust PdfContentStreamProcessor so it uses
> Docume ntFont directly
> > 9. remove CMapAwareDocumentFont from the code base (depends
> > on step 7 above)
> >
> >
> >
> > What do you all think?
> >
> > - K
> >
> >
> > ----------------------- Original Message -----------------------
> >
> > From: "Hoppe, Michael" <michael.ho...@fiz-karlsruhe.de>
> <mailto:michael.ho...@fiz-karlsruhe.de>
> > <mailto:michael.ho...@fiz-karlsruhe.de>
> <mailto:michael.ho...@fiz-karlsruhe.de>
> > To: "Post a ll your questions about iText here"
> > <itext-questions@lists.sourceforge.net>
> <mailto:itext-questions@lists.sourceforge.net>
> > <mailto:itext-questions@lists.sourceforge.net>
> <mailto:itext-questions@lists.sourceforge.net>
> > Cc:
> > Date: Wed, 17 Dec 2008 17:12:58 +0100
> > Subject: Re: [iText-questions] extracting text from pdfs with
> > japanese data
> >
> >
> > Hi all,
> >
> >
> >
> > Attached see the Pdfs i had the problems with (I send them
> > once before)
> >
> > content1.pdf gives : java.io.IOException: '>' not expected at
> > file pointer 39040
> >
> > tic_dogu2.pdf gives java.lang.NullPointerException because
> > font is not embedded in pdf
> >
> >
> >
> > text from content1.pdf can get extracted with the adobe
> > viewer bean ( another open source library that we don't want
> > to use for our project for various reasons) so I don't think
> > there is something wrong with the file itself.
> >
> >
> >
> > Greetings
> >
> >
> >
> > Michael
> >
> >
> >
> > Dr. Michael Hoppe
> > ePublishing & eScience
> > Development & Applied Research
> > Phone +49 7247 808-251
> > Fax +49 7247 808-133
> > michael.ho...@fiz-karlsruhe.de
> <mailto:michael.ho...@fiz-karlsruhe.de>
> >
> >
> > FIZ Karlsruhe
> > Hermann-von-Helmholtz-Platz 1
> > 76344 Eggenstein-Leopoldshafen, Germany
> >
> > www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de>
> <http://www.fiz-karlsruhe.de/> <http://www.fiz-karlsruhe.de/>
> >
> > Von: Kevin Day [mailto:ke...@trumpetinc.com]
> <mailto:ke...@trumpe tinc.com>
> > Gesendet: Mittwoch, 17. Dezember 2008 15:31
> > An: IText Questions
> > Betreff: Re: [iText-questions] extracting text from pdfs with
> > japanese data
> >
> >
> >
> > CMapAwareDocumentFont has this parsing via the CMap class -
> > this encapsulates the parsing behind an object, and makes it
> > a lot easier to deal with.
> >
> >
> >
> > I think that the biggest thing here is actually finding the
> > appropriate CMap data byte stream (either from embedded data
> > in the PDF, or from the file system) - right now, locating
> > the CMap information is a weak point in the content parser.
> >
> >
> >
> > If the cmap data is included in a jar on the classpath, then
> > the CMap could absolutely be read from the jar.
> >
> >
> >
> > Can the OP please send a PDF that demonstrates the issue?
> > I'll take a look at the font information and see how tough it
> > would be to add this type of lookup if TOUNICODE isn't available.
> >
> >
> >
> > - K
> >
> >
> >
> > ----------------------- Original Message -----------------------
> >
> >
> >
> > From: "Paulo Soares" <psoa...@consiste.pt>
> <mailto:psoa...@consiste.pt>
> > <mailto:psoa...@consiste.pt> <mailto:psoa...@consiste.pt>
> >
> > To: "Post all your questions about iText here"
> > <itext-questions@lists.sourceforge.net>
> <mailto:itext-questions@lists.sourceforge.net>
> > <mailto:itext-questions@lists.sourceforge.net>
> <mailto:itext-questions@lists.sourceforge.net>
> >
> > Cc:
> & gt;
> > Date: Tue, 16 Dec 2008 09:55:36 -0000
> >
> > Subject: Re: [iText-questions] extracting text from pdfs with
> > japanese data
> >
> >
> >
> > There's code in PdfEncodings to parse and convert to/from
> > Unicode the cmaps.
> > The font contains the cmap name.
> >
> > Paulo
> >
> > ----- Original Messa ge -----
> > From: "1T3XT info" <i...@1t3xt.info>
> <mailto:i...@1t3xt.info> <mailto:i...@1t3xt.info>
> <mailto:i...@1t3xt.info>
> > To: "Post all your questions about iText here"
> > <itext-questions@lists.sourceforge.net>
> <mailto:itext-questions@lists.sourceforge.net>
> > <mailto:itext-questions@lists.sourceforge.net>
> <mailto:itext-questions@lists.sourceforge.net>
> > Sent: Tuesday, December 16, 2008 9:19 AM
> > Subject: Re: [iText-questions] extracting text from pdfs with
> > japanese data
> >
> >
> > Hoppe, Michael wrote:
> > > The CMap-files are included in the iTextAsianCmaps.jar. So
> > couldn't they
> > > be read from that jar in case there is no font information
> > in the pdf?
> >
> > I'm just thinking out loud here, I didn't dive into the problem yet,
> > but: do you think it's possible for iText to find which
> > CMap-file is t o
> > be inspected based on the font information availa ble in the PDF?
> >
> > As Kevin already said: this part of iText is pretty new. We're all
> > excited about it, but for the moment it's all highly experimental.
> > --
> ; > This answer is provided by 1T3XT BVBA
> > http://www.1t3xt.com/ <http://www.1t3xt.com/> -
> http://www.1t3xt.info <http://www.1t3xt.info>

Aviso Legal:
Esta mensagem é destinada exclusivamente ao destinatário. Pode conter informação confidencial ou legalmente protegida. A incorrecta transmissão desta mensagem não significa a perca de confidencialidade. Se esta mensagem for recebida por engano, por favor envie-a de volta para o remetente e apague-a do seu sistema de imediato. É proibido a qualquer pessoa que não o destinatário de usar, revelar ou distribuir qualquer parte desta mensagem.

Disclaimer:
This message is destin ed exclusively to the intended receiver. It may contain confidential or legally protected information. The incorrect transmission of this message does not mean the loss of its confidentiality. If this message is received by mistake, please send it back to the sender and delete it from your system immediately. It is forbidden to any person who is not the intended receiver to use, distribute or copy any part of this message.

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

__________________________________ _____________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions


Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] extracting text from pdfs with japanese data

Reply via email to