Re: Conversion of TeX-created PDF

Roman Klinger Thu, 29 Oct 2009 09:37:18 -0700

Hi Thomas,

Thomas Fischer wrote:

I am dealing with PDF files that have been created using TeX. Thisseems to create some specific problems.These are earlier papers from the 1990s, newer may be morestandardised and present fewer problems.
1. German Umlauts may or may not be recognised.
For "Hölder" I get once "Ho¨lder" and once "H¨older" in the samedocument. "Ho¨lder" would be correct in UTF-8 if the diaeresis wouldbe combining (Unicode 308) but it is the not combining variety(Unicode A8). The same appears in the html version (here: ¨). Thenot combining character is not a real problem, but putting it beforeonce and after the other time is. PDFBox 0.7.3 seems to useconsistently the version "H¨older".


These are typical problems of the encoding of the font.

An example:
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc} % 1
\usepackage{palatino}
\usepackage{ngerman}
\begin{document}
Here is an ä.
\end{document}

As it is, you get the extraction of "ä".

Comment out the line marked with % 1 and you get "¨a". Withoutspecifying the encoding, the symbol cannot be found in the font and TeXbuilds the umlaut by combining two dots and the "a".

I do not think you can really solve that problem in a nice way. I dealwith such problems with a post processing of the extracted text (e.g.replace ¨a by ä.


By the way, this is not a TeX-specific problem.

Best,
 Roman

--
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Deparment of Bioinformatics
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klin...@scai.fhg.de
http://www.scai.fraunhofer.de/klinger.html

Re: Conversion of TeX-created PDF

Reply via email to