[full quote to get back to BTS]

On Thu, Nov 03, 2005 at 12:50 +0100, Frank Küster wrote:
> Ralf Stubner <[EMAIL PROTECTED]> wrote:
> 
> > Text-extraction from PDF is really complicated. If one adds a few
> > interesting things (fi, ä, ß) to Frank's test file, one finds that
> > pdftotext (best used via 'less <pdf-file>') that 'fi' is not found at
> > all, 'ä' is found, 'ß' is found as 'ÿ', even when processed with
> > pdflatex. IIRC there is some stage in the text-extraction where some
> > default encoding (Latin-1 or something similar) is used. pdflatex
> > probably includes the Type3 font with an encoding equivalent to T1. Now
> > the code position of 'fi' in T1 is not defined in Latin-1, the code
> > position of 'ß' in T1 is 'ÿ' in Latin-1, the code position of 'ä' is the
> > same in both. So this fits. I guess that ghostscript changes the
> > encoding of the Type3 font when creating the PDF, which makes text
> > extraction rather meaningless. If one uses Type1 fonts, ghostscript is
> > probably able to use a sensible encoding based on the glyphnames in the
> > font. 
> 
> That sounds all very sensible, *but*:  On dctt where this first came up
> (Thread started by "Nils"),  several people said that they could use the
> find function on pdf files - I assume they read the question properly
> and used latex/dvips/ps2pdf.

I assume, that those people have cm-super installed. If I enable
cm-super on my system, text-extraction works fine even for 'fi' and 'ß'.
Even if AR 7 is finally able to decently display bitmap fonts, there are
still good reasons to use Type1 fonts.

cheerio
ralf
 

Reply via email to