[full quote to get back to BTS] On Thu, Nov 03, 2005 at 12:50 +0100, Frank Küster wrote: > Ralf Stubner <[EMAIL PROTECTED]> wrote: > > > Text-extraction from PDF is really complicated. If one adds a few > > interesting things (fi, ä, ß) to Frank's test file, one finds that > > pdftotext (best used via 'less <pdf-file>') that 'fi' is not found at > > all, 'ä' is found, 'ß' is found as 'ÿ', even when processed with > > pdflatex. IIRC there is some stage in the text-extraction where some > > default encoding (Latin-1 or something similar) is used. pdflatex > > probably includes the Type3 font with an encoding equivalent to T1. Now > > the code position of 'fi' in T1 is not defined in Latin-1, the code > > position of 'ß' in T1 is 'ÿ' in Latin-1, the code position of 'ä' is the > > same in both. So this fits. I guess that ghostscript changes the > > encoding of the Type3 font when creating the PDF, which makes text > > extraction rather meaningless. If one uses Type1 fonts, ghostscript is > > probably able to use a sensible encoding based on the glyphnames in the > > font. > > That sounds all very sensible, *but*: On dctt where this first came up > (Thread started by "Nils"), several people said that they could use the > find function on pdf files - I assume they read the question properly > and used latex/dvips/ps2pdf.
I assume, that those people have cm-super installed. If I enable cm-super on my system, text-extraction works fine even for 'fi' and 'ß'. Even if AR 7 is finally able to decently display bitmap fonts, there are still good reasons to use Type1 fonts. cheerio ralf