Re: text extraction

Andreas Lehmkühler Fri, 10 Sep 2010 01:25:28 -0700

Hi,

Gesendet: Mi, 08. Sep 2010 Von: reinhard schwab<reinhard.sch...@aon.at>


> Andreas Lehmkühler schrieb:
> > Gesendet: Sa, 04. Sep 2010 Von: reinhard schwab<reinhard.sch...@aon.at>
> >> extracted text with
> >>
> >> PDDocument doc = PDDocument.load(new URL(
> >>                            
> >> "http://people.ischool.berkeley.edu/~hearst/irbook/print/chap10.pdf";));
> >> PDFTextStripper stripper = new PDFTextStripper();
> >> stripper.writeText(doc, new OutputStreamWriter(System.out));
> >>
> >> looks like this
> >>
> >> ¡ ¢¤£¦¥¨§ª© ®©°¯±¢²§ª³ ´¶µ¸·¹¢º© » ¥¼µ½§?·?¥??¼´²Â
> >>  "!$#&%ª')(+* ,-%ª.?/0%?132"%?45.?6
> >> ,-.7'84:97!;.7'< "!>=?.ª!>'?*�...@b.c4®*
> >> ACM Press
> >> New York
> >> Addison-Wesley
> >> D)EGFIH J>KMLON8P$QRH ESPUT?V?WYXZE>TR[\PUQ]L_^`E>ababE>cedgfUahX;ijija
> >>     
> > The mentioned pdf uses type3 fonts for most of the text. Those font type
> consists of glyphs for every single letter and doesn't have any encoding. In
> most cases those kind of text content can't be extracted, even the acrobat
> reader won't do it (try it by selecting some of the text and just c&p it to
> a texteditor. The text will be scrambled).
> >
> > BR
> > Andreas Lehmkühler
> >
> >   
> hi,
> so what is pdfbox doing now with such fonts?
> when i try to extract some text from a pdf file, i expect to get
> readable text.
I'm afraid you have to change your expectation.

> i interface pdfbox by using the tika api.
> the code is:
> 
>         if  ("application/pdf".equals(contentType)) {
>             parser = new PDFParser();
>         }
>         InputStream responseBody = new ByteArrayInputStream(content);
> 
>         ContentHandler textHandler = new BodyContentHandler(10000000);
>         ParseContext pc = new ParseContext();
>         try {
>             parser.parse(responseBody, textHandler, metadata, pc);
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
> 
> should the PDFParser in Tika catch this or should pdfbox catch this or
> should my application interfacing Tika catch this?
> i now have to check the text returned by Tika for such nonreadable text
> because i index it with lucene etc...
> is it obvious for pdfbox that it cant extract the text in this situation?
If we use the usage of a type3 font as indicator for unextractable text, it 
will be possible to trigger some sort of a feedback that (a part of ) the text 
is unreadable.

> is there no chance to translate or map these glyphs back into characters?
AFAIK type3-fonts never have an encoding to map glyps into characters. I'm 
sorry, but there is no chance to extract it with pdfbox or any other tool like 
the acrobat reader. The only way I know is to use an OCR-software.

BR
Andreas Lehmkühler

Re: text extraction

Reply via email to