[
https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Fischer updated PDFBOX-728:
----------------------------------
Attachment: wias_preprints_1401-v1.2.0SNAPSHOT.txt
I have compiled pdfbox-1.2.0-SNAPSHOT.jar and extracted the text from
wias_preprints_1401.pdf.
The result shows a significant improvement over the previous version.
The only major error remaining is that n-dash (ASCII 21 or \x15 in the
original) is mapped to ASCII 184 (\xB8) instead of 8211 (x2013): (1.1)¸(1.5)
instead of (1.1)-(1.5).
A minor error is
H1 ↪→ Lp instead of H1 ↪ Lp, this is due to the TeX construction used:
arrowhookleft + arrow is supposed to create a left hook and combine it with an
arrow, I don't know if there is such a construction in Unicode, so I suggested
to use "RIGHTWARDS ARROW WITH HOOK" for the *combination* of the two characters.
But I will have to test how these adjustments work on TeX files created in
other contexts.
(BTW, the erroneous ß in "we can somewhat relax the ßmallnessrequirement" is
also contained in the original PDF, so no error on the side of PDFBox.)
> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
> Key: PDFBOX-728
> URL: https://issues.apache.org/jira/browse/PDFBOX-728
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.1.0
> Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText
> -encoding UTF-8
> Reporter: Thomas Fischer
> Priority: Minor
> Fix For: 1.2.0
>
> Attachments: wias_preprints_1401-v1.2.0SNAPSHOT.txt,
> wias_preprints_1401.pdf, wias_preprints_1401.txt,
> wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in
> a hex-encoded form, probably interspersed with some non encoded characters as
> in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63
> x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c
> x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65
> x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be
> off, I had to add e.g.
> s/ÿ/ß/g;
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.