Re: [iText-questions] extracting text from pdfs with japanese data

Leonard Rosenthol Wed, 17 Dec 2008 12:09:48 -0800

If there is no ToUnicode table for an Identity-H encoded font, thenyou can't get the text. cmaps aren't relevant in that case :).


Leonard



On Dec 17, 2008, at 1:35 PM, Kevin Day wrote:

OK - we know that content1.pdf is choking b/c of the embeddedimages. To fix that, the PdfContentParser class would have to beenhanced to properly handle those. This is outside of my scope, butmaybe someone else on the list is up for trying to add it.
There is a lot of content in tic_doug2.pdf, which makes it hard todevelop a decent use case - as things progress, it may beappropriate to generate a sample PDF that contains just one font andsome pretty simple content (maybe two or three lines of text with 4or 5 words each). But given the PDF we have now, the thing that isfailing (at least for me) is loading of one of the fonts (F1).
Here is the info from the fonts dictionary:
Subdictionary /F1 = (/DescendantFonts=[4 0 R], /BaseFont=/GothicBBB-Medium-Identity-H, /Type=/Font, /Encoding=/Identity-H, /Subtype=/Type0)
DocumentFont recognizes this as a Type0 font, and attempts to parseit. However, it chokes b/c the ToUnicode map is not specified forthe font. As we know, this is expected, because it should bepossible to obtain a CMap from elsewhere in the system, or intuitthe CMap based on information in the font (is that right??). I havea patch (below) that attempts to work around this requirement. It'snot pretty, but it should tell us if we are on the right track or not.
So there are two pieces that have to be in place for this to work:

1.  Ability to find the appropriate CMap resource
2.  Ability to load and use the CMap

I'll take each in turn:


Finding the appropriate CMap resource
I'm not at all experienced with font dictionaries. But my initialthinking is that the F1 font listed above probably does not requirean external CMap - the encoding specified is Identify-H, soshouldn't that be the CMap we use? And isn't that CMap basicallyjust a one-to-one mapping of characters to bytes? Or are CMapsintended as a supplimentary mapping technology, so is it moreappropriate to say there is no CMap, so we shouldn't be trying tofill font metrics at all?
I've put together a patch for DocumentFont that just assumes anidentity mapping when there is no ToUnicode entry provided (if itwould be better to handle this sort of thing in an SVN branch,please let me know. This patch removes the null pointer exception,and attempts to build an appropriate metrics map. Note that thisimplementation is horrendously inneficient ( we wind up creating ametrics map that has every potential character in it) - but itshould at least prove out the concept.
If this is indeed what needs to be done in cases where an Identity-HCMap is appropriate, then we can optimize... Be sure to read thesection that appears after the following patch (outlining changes toDocumentFont to have it work with CMaps directly) - this is thebeginning of the change required to handle the non-tounicodesituation efficiently (without storing a map essentially saying 1=1,2=2, 3=3, ... 0xffff = 0xffff).
With this patch in place, my text content parser does not fail - itproduces text strings. I do not have the font in question on mysystem, so I can't tell if the actual extracted text is correct ornot - perhaps Michael can apply the patch and run the tic_doug2.pdffile through PdfContentReaderTool and see what it gives him?
### Eclipse Workspace Patch 1.0
#P iText Trunk
Index: src/core/com/lowagie/text/pdf/DocumentFont.java
===================================================================
--- src/core/com/lowagie/text/pdf/DocumentFont.java (revision 3613)
+++ src/core/com/lowagie/text/pdf/DocumentFont.java (working copy)
@@ -155,7 +155,8 @@

     private void processType0(PdfDictionary font) {
         try {
- byte[] touni =PdfReader.getStreamBytes((PRStream)PdfReader.getPdfObjectRelease(font.get(PdfName.TOUNICODE)));+ PdfObject toUnicodeReference =font.get(PdfName.TOUNICODE);+ byte[] touni = toUnicodeReference == null ? null :PdfReader.getStreamBytes((PRStream )PdfReader.getPdfObjectRelease(toUnicodeReference));PdfArray df =(PdfArray)PdfReader.getPdfObjectRelease(font.get(PdfName.DESCENDANTFONTS));PdfDictionary cidft =(PdfDictionary)PdfReader.getPdfObjectRelease((PdfObject)df.getArrayList().get(0));PdfNumber dwo =(PdfNumber)PdfReader.getPdfObjectRelease(cidft.get(PdfName.DW));
@@ -204,6 +205,20 @@
     }
private void fillMetrics(byte[] touni, IntHashtable widths, intdw) {
+        if (touni == null){ // just assume a one-to-one mapping
+ // this is hideously inefficient - much better to use aCMap object
+  ;
+            for(int i = 0; i <= 0xffff; i++){
+                int unic = i;
+                int w = dw;
+                if (widths.containsKey(unic))
+                    w = widths.get(i);
+                metrics.put(new Integer(unic), new int[]{unic, w});
+            }
+
+       & nbsp;    return;
+        }
+
         try {
PdfContentParser ps = new PdfContentParser(newPRTokeniser(touni));
             PdfObject ob = null;


Loading and using the CMap
I think that the correct way to solve this is to make some changesto DocumentFont.fillMetrics so that it works with CMap objectsinstead of the unicode byte array. The CMap object would beobtained (or at least attempted) during construction of DocumentFont(regardless of the type of font!).
This will actually drastically simplify fillMetrics (this kind ofparsing probably doesn't belong inside a font object anyway), and itwill make the CMap available via the DocumentFont directly. If wedo this, then the need for CMapAwareDocumentFont is removedentirely, and we can remove that class.
To keep things efficient, I could use some help with the change tofillMetrics, specifically how to fill the 'metrics' map instancevariable given a CMap object - or does CMap remove the need for themetrics map entirely? Maybe the map should just be a 'widths' map?If it removes the need entirely, then I think adjusting DocumentFontto just use CMap makes the DocumentFont source much more readable.
So, I think I see a way forward here - but I'm going to need somehelp from the iText maintainers. Here's what I see as how to proceed:
1. add a loadCMap() method to DocumentFont (the results of the'finding the appropriate CMap resource' discussion above will beused for this)
2.  call loadCMap() from DocumentFont(PRIndirectReference refFont)
3.  adjust both getWidth() methods so they use CMap instead of metrics
4. adjust both convertToBytes() methods so they use CMap instead ofmetrics
5.  adjust charExists() so it uses CMap instead of metrics
6.  add getCMap() to DocumentFont
7. move CMapAwareDocumentFont.encode() to DocumentFont (if this isundesirable, we could continue to have this method in a sub-class -it just seems more intuitive to have it in DocumentFont)
once that is in place, I can:

8.  adjust PdfContentStreamProcessor so it uses DocumentFont directly
9. remove CMapAwareDocumentFont from the code base (depends on step7 above)
What do you all think?

- K


----------------------- Original Message -----------------------

From: "Hoppe, Michael" <michael.ho...@fiz-karlsruhe.de>
To: "Post all your questions about iText here" <itext-questions@lists.sourceforge.net>
Cc:
Date: Wed, 17 Dec 2008 17:12:58 +0100
Subject: Re: [iText-questions] extracting text from pdfs withjapanese data
Hi all,
Attached see the Pdfs i had the problems with (I send them oncebefore)content1.pdf gives : java.io.IOException: '>' not expected at filepointer 39040tic_dogu2.pdf gives java.lang.NullPointerException because font isnot embedded in pdf
text from content1.pdf can get extracted with the adobe viewer bean(another open source library that we don’t want to use for ourproject for various reasons) so I don’t think there is somethingwrong with the file itself.
Greetings

Michael

Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
michael.ho...@fiz-karlsruhe.de


FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany

www.fiz-karlsruhe.de
Von: Kevin Day [mailto:ke...@trumpetinc.com]
Gesendet: Mittwoch, 17. Dezember 2008 15:31
An: IText Questions
Betreff: Re: [iText-questions] extracting text from pdfs withjapanese data
CMapAwareDocumentFont has this parsing via the CMap class - thisencapsulates the parsing behind an object, and makes it a lot easierto deal with.
I think that the biggest thing here is actually finding theappropriate CMap data byte stream (either from embedded data in thePDF, or from the file system) - right now, locating the CMapinformation is a weak point in the content parser.
If the cmap data is included in a jar on the classpath, then theCMap could absolutely be read from the jar.
Can the OP please send a PDF that demonstrates the issue? I'll takea look at the font information and see how tough it would be to addthis type of lookup if TOUNICODE isn't available.
- K

----------------------- Original Message -----------------------

From: "Paulo Soares" <psoa...@consiste.pt>
To: "Post all your questions about iText here" <itext-questions@lists.sourceforge.net>
Cc:
Date: Tue, 16 Dec 2008 09:55:36 -0000
Subject: Re: [iText-questions] extracting text from pdfs withjapanese data
There's code in PdfEncodings to parse and convert to/from Unicodethe cmaps.
The font contains the cmap name.

Paulo

----- Original Message -----
From: "1T3XT info" <i...@1t3xt.info>
To: "Post all your questions about iText here"
<itext-questions@lists.sourceforge.net>
Sent: Tuesday, December 16, 2008 9:19 AM
Subject: Re: [iText-questions] extracting text from pdfs withjapanese data
Hoppe, Michael wrote:
> The CMap-files are included in the iTextAsianCmaps.jar. Socouldn’t they> be read from that jar in case there is no font information in thepdf?
I'm just thinking out loud here, I didn't dive into the problem yet,
but: do you think it's possible for iText to find which CMap-file ist o
be inspected based on the font information availa ble in the PDF?

As Kevin already said: this part of iText is pretty new. We're all
excited about it, but for the moment it's all highly experimental.
--
This answer is provided by 1T3XT BVBA
http://www.1t3xt.com/ - http://www.1t3xt.info


------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,Nevada.The future of the web can't happen without you. Join us at MIX09 tohelp
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php


-------------------------------------------------------
Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische Information mbH.Sitz der Gesellschaft: Eggenstein-Leopoldshafen, AmtsgerichtMannheim HRB 101892.
Geschäftsführerin: Sabine Brünger-Weilandt.
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.



------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,Nevada.The future of the web can't happen without you. Join us at MIX09 tohelp
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,Nevada.The future of the web can't happen without you. Join us at MIX09 tohelp
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] extracting text from pdfs with japanese data

Reply via email to