Re: [iText-questions] extracting text from pdfs with japanese data

Paulo Soares Thu, 18 Dec 2008 03:25:54 -0800

 

> -----Original Message-----
> From: Kevin Day [mailto:[email protected]] 
> Sent: Wednesday, December 17, 2008 11:09 PM
> To: IText Questions
> Subject: Re: [iText-questions] extracting text from pdfs with 
> japanese data
> 
> Ahhh mea-culpa...  I do think I remember reading something about this.
>  
> The PDF specification (section 9.10.2) says that there are 
> ways to extract unicode without a ToUnicode map in a handful 
> of specialized cases - but Identity-H definitely isn't one of 
> them - unless we can load a CMap from an external resource?
>  
> So, I've learned a lot - and the answer to the OP is that the 
> tic_dogu2.pdf file doesn't have computer readable text in it at all.
>


Yes, it has, otherwise it wouldn't be possible to show the text without an 
embedded font. In this case Identity-H means that the CID characters are 
directly put in the content and that a cmap is not needed to convert from 
content to CID. For each ordering there's a definition of what CID corresponds 
to what character and this doesn't change. We can think of the CID as a kind of 
Adobe specific Unicode for each ordering.

Paulo
  
>  
> One thing that has always puzzled me is this disparity 
> between unicode and glyphs.  I'm probably limited in my 
> understanding by the fact that my entire exposure to fonts 
> has been with US character sets, but it seems to me that a 
> given glyph would have a unicode equivalent in any font, 
> wouldn't it?  If there weren't some type of mapping, it 
> wouldn't be possible to render the font for a given unicode 
> sequence, making the font pretty much useless.  Can anyone 
> share some light on this?  I'm sorry if this question is 
> naive - I'm just trying to get my head around the difference.
>  
> - K
> 
>  
> ----------------------- Original Message -----------------------
>   
> From: Leonard Rosenthol <[email protected]> 
> <mailto:[email protected]> 
> To: Post all your questions about iText here 
> <[email protected]> 
> <mailto:[email protected]> 
> Cc: 
> Date: Wed, 17 Dec 2008 15:09:16 -0500
> Subject: Re: [iText-questions] extracting text from pdfs with 
> japanese data
>   
> If there is no ToUnicode table for an Identity-H encoded 
> font, then you can't get the text.  cmaps aren't relevant in 
> that case :). 
> 
> Leonard
> 
> 
> On Dec 17, 2008, at 1:35 PM, Kevin Day wrote:
> 
> 
>       
>       
>       OK - we know that content1.pdf is choking b/c of the 
> embedded images.  To fix that, the PdfContentParser class 
> would have to be enhanced to properly handle those.  This is 
> outside of my scope, but maybe someone else on the list is up 
> for trying to add it.
>        
>       There is a lot of content in tic_doug2.pdf, which makes 
> it hard to develop a decent use case - as things progress, it 
> may be appropriate to generate a sample PDF that contains 
> just one font and some pretty simple content (maybe two or 
> three lines of text with 4 or 5 words each).  But given the 
> PDF we have now, the thing that is failing (at least for me) 
> is loading of one of the fonts (F1).
>        
>       Here is the info from the fonts dictionary:
>        
>       
>       Subdictionary /F1 = (/DescendantFonts=[4 0 R], 
> /BaseFont=/GothicBBB-Medium-Identity-H, /Type=/Font, 
> /Encoding=/Identity-H, /Subtype=/Type0)
> 
>       DocumentFont recognizes this as a Type0 font, and 
> attempts to parse it.  However, it chokes b/c the ToUnicode 
> map is not specified for the font.  As we know, this is 
> expected, because it should be possible to obtain a CMap from 
> elsewhere in the system, or intuit the CMap based on 
> information in the font (is that right??).  I have a patch 
> (below) that attempts to work around this requirement.  It's 
> not pretty, but it should tell us if we are on the right track or not.
>        
>        
>       So there are two pieces that have to be in place for 
> this to work:
>        
>       1.  Ability to find the appropriate CMap resource
>       2.  Ability to load and use the CMap
>        
>       I'll take each in turn:
>        
>        
>       Finding the appropriate CMap resource
>        
>       I'm not at all experienced with font dictionaries.  But 
> my initial thinking is that the F1 font listed above probably 
> does not require an external CMap - the encoding specified is 
> Identify-H, so shouldn't that be the CMap we use?  And isn't 
> that CMap basically just a one-to-one mapping of characters 
> to bytes?  Or are CMaps intended as a supplimentary mapping 
> technology, so is it more appropriate to say there is no 
> CMap, so we shouldn't be trying to fill font metrics at all?
>        
>       I've put together a patch for DocumentFont that just 
> assumes an identity mapping when there is no ToUnicode entry 
> provided (if it would be better to handle this sort of thing 
> in an SVN branch, please let me know.  This patch removes the 
> null pointer exception, and attempts to build an appropriate 
> metrics map.  Note that this implementation is horrendously 
> inneficient ( we wind up creating a metrics map that has 
> every potential character in it) - but it should at least 
> prove out the concept.
>        
>       If this is indeed what needs to be done in cases where 
> an Identity-H CMap is appropriate, then we can optimize...  
> Be sure to read the section that appears after the following 
> patch (outlining changes to DocumentFont to have it work with 
> CMaps directly) - this is the beginning of the change 
> required to handle the non-tounicode situation efficiently 
> (without storing a map essentially saying 1=1, 2=2, 3=3, ... 
> 0xffff = 0xffff).
>        
>       With this patch in place, my text content parser does 
> not fail - it produces text strings.  I do not have the font 
> in question on my system, so I can't tell if the actual 
> extracted text is correct or not - perhaps Michael can apply 
> the patch and run the tic_doug2.pdf file through 
> PdfContentReaderTool and see what it gives him?
>        
>        
>        
>       ### Eclipse Workspace Patch 1.0
>       #P iText Trunk
>       Index: src/core/com/lowagie/text/pdf/DocumentFont.java
>       
> ===================================================================
>       --- src/core/com/lowagie/text/pdf/DocumentFont.java 
> (revision 3613)
>       +++ src/core/com/lowagie/text/pdf/DocumentFont.java 
> (working copy)
>       @@ -155,7 +155,8 @@
>            
>            private void processType0(PdfDictionary font) {
>                try {
>       -            byte[] touni = 
> PdfReader.getStreamBytes((PRStream)PdfReader.getPdfObjectRelea
> se(font.get(PdfName.TOUNICODE)));
>       +            PdfObject toUnicodeReference = 
> font.get(PdfName.TOUNICODE);
>       +            byte[] touni = toUnicodeReference == nu ll 
> ? null : PdfReader.getStreamBytes((PRStream 
> )PdfReader.getPdfObjectRelease(toUnicodeReference));
>                    PdfArray df = 
> (PdfArray)PdfReader.getPdfObjectRelease(font.get(PdfName.DESCE
> NDANTFONTS));
>                    PdfDictionary cidft = 
> (PdfDictionary)PdfReader.getPdfObjectRelease((PdfObject)df.get
ArrayList().get(0));
>                    PdfNumber dwo = 
> (PdfNumber)PdfReader.getPdfObjectRelease(cidft.get(PdfName.DW));
>       @@ -204,6 +205,20 @@
>            }
>            
>            private void fillMetrics(byte[] touni, 
> IntHashtable widths, int dw) {
>       +        if (touni == null){ // just assume a one-to-one mapping
>       +        &nb sp;   // this is hideously inefficient - 
> much better to use a CMap object
>       +  ;           
>       +            for(int i = 0; i <= 0xffff; i++){
>       +                int unic = i;
>       +                int w = dw;
>       +                if (widths.containsKey(unic))
>       +                    w = widths.get(i);
>       +                metrics.put(new Integer(unic), new 
> int[]{unic, w});
>       +         ;    }
>       +            
>       +       & nbsp;    return;
>       +        }
>       +        
>                try {
>                    PdfContentParser ps = new 
> PdfContentParser(new PRTokeniser(touni));
>                    PdfObject ob = null;
>       
>        
>        
>       Loading and using the CMap
>        
>       I think that the correct way to solve this is to make 
> some changes to DocumentFont.fillMetrics so that it works 
> with CMap objects instead of the unicode byte array.  The 
> CMap object would be obtained (or at least attempted) during 
> construction of DocumentFont (regardless of the type of font!).
>        
>       This will actually drastically simplify fillMetrics 
> (this kind of parsing probably doesn't belong inside a font 
> object anyway), and it will make the CMap available via the 
> DocumentFont directly.  If we do this, then the need for 
> CMapAwareDocumentFont is removed entirely, and we can remove 
> that class.
>        
>        
>       To keep things efficient, I could use some help with 
> the change to fillMetrics, specifically how to fill the 
> 'metrics' map instance variable given a CMap object - or does 
> CMap remove the need for the metrics map entirely?  Maybe the 
> map should just be a 'widths' map?  If it removes the need 
> entirely, then I think adjusting DocumentFont to just use 
> CMap makes the DocumentFont source much more readable.
>        
>        
>       So, I think I see a way forward here - but I'm going to 
> need some help from the iText maintainers.  Here's what I see 
> as how to proceed:
>        
>       1.  add a loadCMap() method to DocumentFont  (the 
> results of the 'finding the appropriate CMap resource' 
> discussion above will be used for this)
>       2.  call loadCMap() from 
> DocumentFont(PRIndirectReference refFont)
>       3.  adjust both getWidth() methods so they use CMap 
> instead of metrics
>       4.  adjust both convertToBytes() methods so they use 
> CMap instead of metrics
>       5.  adjust charExists() so it uses CMap instead of metrics
>       6.  add getCMap() to DocumentFont
>       7.  move CMapAwareDocumentFont.encode() to DocumentFont 
>  (if this is undesirable, we could continue to have this 
> method in a sub-class - it just seems more intuitive to have 
> it in DocumentFont)
>        
>       once that is in place, I can:
>        
>       8.  adjust PdfContentStreamProcessor so it uses 
> DocumentFont directly
>       9.  remove CMapAwareDocumentFont from the code base 
> (depends on step 7 above)
>        
>        
>        
>       What do you all think?
>        
>       - K
>        
>        
>       ----------------------- Original Message -----------------------
>         
>       From: "Hoppe, Michael" <[email protected]> 
> <mailto:[email protected]> 
>       To: "Post all your questions about iText here" 
> <[email protected]> 
> <mailto:[email protected]> 
>       Cc: 
>       Date: Wed, 17 Dec 2008 17:12:58 +0100
>       Subject: Re: [iText-questions] extracting text from 
> pdfs with japanese data
>         
>       Hi all,
>        
>       Attached see the Pdfs i had the problems with (I send 
> them once before)
>       content1.pdf gives : java.io.IOException: '>' not 
> expected at file pointer 39040
>       tic_dogu2.pdf gives java.lang.NullPointerException 
> because font is not embedded in pdf
>        
>       text from content1.pdf can get extracted with the adobe 
> viewer bean (another open source library that we don't want 
> to use for our project for various reasons) so I don't think 
> there is something wrong with the file itself.
>        
>       Greetings
>        
>       Michael
>        
>       Dr. Michael Hoppe
>       ePublishing & eScience
>       Development & Applied Research
>       Phone +49 7247 808-251
>       Fax +49 7247 808-133
>       [email protected]
>       
>       
>       FIZ Karlsruhe
>       Hermann-von-Helmholtz-Platz 1
>       76344 Eggenstein-Leopoldshafen, Germany
>       
>       www.fiz-karlsruhe.de <http://www.fiz-karlsruhe.de/> 
>       Von: Kevin Day [mailto:[email protected]] 
>       Gesendet: Mittwoch, 17. Dezember 2008 15:31
>       An: IText Questions
>       Betreff: Re: [iText-questions] extracting text from 
> pdfs with japanese data
>        
>       CMapAwareDocumentFont has this parsing via the CMap 
> class - this encapsulates the parsing behind an object, and 
> makes it a lot easier to deal with.
>        
>       I think that the biggest thing here is actually finding 
> the appropriate CMap data byte stream (either from embedded 
> data in the PDF, or from the file system) - right now, 
> locating the CMap information is a weak point in the content parser.
>        
>       If the cmap data is included in a jar on the classpath, 
> then the CMap could absolutely be read from the jar.
>        
>       Can the OP please send a PDF that demonstrates the 
> issue?  I'll take a look at the font information and see how 
> tough it would be to add this type of lookup if TOUNICODE 
> isn't available.
>        
>       - K
>        
>       ----------------------- Original Message -----------------------
>         
>       From: "Paulo Soares" <[email protected]> 
> <mailto:[email protected]> 
>       To: "Post all your questions about iText here" 
> <[email protected]> 
> <mailto:[email protected]> 
>       Cc: 
>       Date: Tue, 16 Dec 2008 09:55:36 -0000
>       Subject: Re: [iText-questions] extracting text from 
> pdfs with japanese data
>         
>       There's code in PdfEncodings to parse and convert 
> to/from Unicode the cmaps. 
>       The font contains the cmap name.
>       
>       Paulo
>       
>       ----- Original Message ----- 
>       From: "1T3XT info" <[email protected]> <mailto:[email protected]> 
>       To: "Post all your questions about iText here" 
>       <[email protected]> 
> <mailto:[email protected]> 
>       Sent: Tuesday, December 16, 2008 9:19 AM
>       Subject: Re: [iText-questions] extracting text from 
> pdfs with japanese data
>       
>       
>       Hoppe, Michael wro te:
>       > The CMap-files are included in the 
> iTextAsianCmaps.jar. So couldn't they
>       > be read from that jar in case there is no font 
> information in the pdf?
>       
>       I'm just thinking out loud here, I didn't dive into the 
> problem yet,
>       but: do you think it's possible for iText to find which 
> CMap-file is t o
>       be inspected based on the font information availa ble 
> in the PDF?
>       
>       As Kevin already said: this part of iText is pretty 
> new. We're all
>       excited about it, but for the moment it's all highly 
> experimental.
>       -- 
>       This answer is provided by 1T3XT BVBA
>       http://www.1t3xt.com/ - http://www.1t3xt.info


Aviso Legal:
Esta mensagem é destinada exclusivamente ao destinatário. Pode conter 
informação confidencial ou legalmente protegida. A incorrecta transmissão desta 
mensagem não significa a perca de confidencialidade. Se esta mensagem for 
recebida por engano, por favor envie-a de volta para o remetente e apague-a do 
seu sistema de imediato. É proibido a qualquer pessoa que não o destinatário de 
usar, revelar ou distribuir qualquer parte desta mensagem. 

Disclaimer:
This message is destined exclusively to the intended receiver. It may contain 
confidential or legally protected information. The incorrect transmission of 
this message does not mean the loss of its confidentiality. If this message is 
received by mistake, please send it back to the sender and delete it from your 
system immediately. It is forbidden to any person who is not the intended 
receiver to use, distribute or copy any part of this message.

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] extracting text from pdfs with japanese data

Reply via email to