[jira] Updated: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Thomas Fischer (JIRA) Mon, 17 May 2010 05:54:10 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Fischer updated PDFBOX-728:
----------------------------------

    Priority: Minor  (was: Major)

While the major problems appear to be fixed, the attached file 
wias_preprints_1401_r944875.txt shows still some minor mistakes. I hope these 
can be solved on a general level and are not specific to this particular file 
(I found this behaviour in all the hex-encoded files).

The following additional replacements are required (Perl notation), note that 
the characters to be repacled in the ACII range 1-31 are usually not printable:
        s/\x15/-/g;     #        = 21 = x15
        s/\x1b/ff/g;    #        = 27 = x1b
        s/\x1c/fi/g;    #        = 28 = x1c
        s/\x1d/fl/g;    #        = 29 = x1d
        s/\x1e/ffi/g;   #        = 30 = x1e
        s/\x8a/Ł/g;     #        = 138 = x8a
        s/\xff/ß/g;             #       ÿ = 255 = xff

Furthermore, some of these files use non-standard TeX notation, I don't know if 
that can be dealt with (the character in quotes is present in the file):
"%"     arrownortheast should probably be \nearrow (TeX):
↗       NORTH EAST ARROW

",→"    arrowhookleft + arrow should probably be \hookrightarrow (TeX):
↪       RIGHTWARDS ARROW WITH HOOK

"*"     arrowrighttophalf should probably be \rightharpoonup (TeX):
⇀       RIGHTWARDS HARPOON WITH BARB UPWARDS

And finally:
f prime should be replaced by f′ instead of f ′ (no space)

> Text extracted from a TeX-created PDF file comes in some form of hex encoding
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText 
> -encoding UTF-8
>            Reporter: Thomas Fischer
>            Priority: Minor
>             Fix For: 1.2.0
>
>         Attachments: wias_preprints_1401.pdf, wias_preprints_1401.txt, 
> wias_preprints_1401_r944875.txt
>
>
> The text in this example is extracted essentially correctly, but presented in 
> a hex-encoded form, probably interspersed with some non encoded characters as 
> in the following example:
> x54x6f x69x6ex63x6fx72x70x6fx72x61x74x65 x74x68x65 x65x6cx61x73x74x69x63 
> x70x72x6fx70x65x72x74x69x65x73 x6fx66 x74x68x65 x6dx61x74x65x72x69x61x6cx2c 
> x77x65 x6ex65x65x64 x74x6f x69x6ex74x72x6fx64x75x63x65 x74x68x65 
> x64x65x66x6fx72x2d
> x6dx61x74x69x6fx6e x74x65x6ex73x6fx72
> F(X, t) = ∂x∂X (X, t).
> A Perl command like
> s/x([\da-f]{2})/chr(hex($1))/eg;
> will usually reveal a correct translation, although certain characters may be 
> off, I had to add e.g.
> s/ÿ/ß/g;
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-728) Text extracted from a TeX-created PDF file comes in some form of hex encoding

Reply via email to