Re: text extraction

Andreas Lehmkühler Mon, 06 Sep 2010 00:34:30 -0700

Hi,


Gesendet: Sa, 04. Sep 2010 Von: reinhard schwab<reinhard.sch...@aon.at>

> extracted text with
> 
> PDDocument doc = PDDocument.load(new URL(
>                            
> "http://people.ischool.berkeley.edu/~hearst/irbook/print/chap10.pdf";));
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.writeText(doc, new OutputStreamWriter(System.out));
> 
> looks like this
> 
> ¡ ¢¤£¦¥¨§ª© ®©°¯±¢²§ª³ ´¶µ¸·¹¢º© » ¥¼µ½§?·?¥??¼´²Â
>  "!$#&%ª')(+* ,-%ª.?/0%?132"%?45.?6
> ,-.7'84:97!;.7'< "!>=?.ª!>'?*�...@b.c4®*
> ACM Press
> New York
> Addison-Wesley
> D)EGFIH J>KMLON8P$QRH ESPUT?V?WYXZE>TR[\PUQ]L_^`E>ababE>cedgfUahX;ijija
The mentioned pdf uses type3 fonts for most of the text. Those font type 
consists of glyphs for every single letter and doesn't have any encoding. In 
most cases those kind of text content can't be extracted, even the acrobat 
reader won't do it (try it by selecting some of the text and just c&p it to a 
texteditor. The text will be scrambled).

BR
Andreas Lehmkühler

Re: text extraction

Reply via email to