Hi Thomas,
Thomas Fischer wrote:
I am dealing with PDF files that have been created using TeX. This
seems to create some specific problems.
These are earlier papers from the 1990s, newer may be more
standardised and present fewer problems.
1. German Umlauts may or may not be recognised.
For "Hölder" I get once "Ho¨lder" and once "H¨older" in the same
document. "Ho¨lder" would be correct in UTF-8 if the diaeresis would
be combining (Unicode 308) but it is the not combining variety
(Unicode A8). The same appears in the html version (here: ¨). The
not combining character is not a real problem, but putting it before
once and after the other time is. PDFBox 0.7.3 seems to use
consistently the version "H¨older".
These are typical problems of the encoding of the font.
An example:
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc} % 1
\usepackage{palatino}
\usepackage{ngerman}
\begin{document}
Here is an ä.
\end{document}
As it is, you get the extraction of "ä".
Comment out the line marked with % 1 and you get "¨a". Without
specifying the encoding, the symbol cannot be found in the font and TeX
builds the umlaut by combining two dots and the "a".
I do not think you can really solve that problem in a nice way. I deal
with such problems with a post processing of the extracted text (e.g.
replace ¨a by ä.
By the way, this is not a TeX-specific problem.
Best,
Roman
--
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Deparment of Bioinformatics
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klin...@scai.fhg.de
http://www.scai.fraunhofer.de/klinger.html