Text extracted from a TeX-created PDF file comes in some form of hex encoding
-----------------------------------------------------------------------------

                 Key: PDFBOX-728
                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.1.0
         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText 
-encoding UTF-8
            Reporter: Thomas Fischer


The text in this example is extracted essentially correctly, but presented in a 
hex-encoded form, probably interspersed with some non encoded characters as in 
the following example:

x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 
x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c 
x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 
x64x65x66x6fx72x2d
x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
F(X, t) = ∂x∂X (X, t).
A Perl command like
s/x([\da-f]{2})/chr(hex($1))/eg;
will usually reveal a correct translation, although certain characters may be 
off, I had to add e.g.
s/ÿ/ß/g;
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to