PDF supplies glyph->unicode mapping, but PDFBox doesn't use it.
---------------------------------------------------------------

                 Key: PDFBOX-1127
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1127
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 1.7.0
         Environment: Tested trunk r1177011
            Reporter: Robert Muir


We had a user report this PDF to the lucene lists: 
http://www.lucidimagination.com/search/document/7a8c14a534d9a84c/tika_can_not_parse_all_of_the_persian_pdf_files

I asked them to create a TIKA issue (TIKA-713) and attach the PDF file

Upon inspection, the fonts used in the PDF have custom encodings (that map the 
characters to U+0001, U+0002, ...), however they contain a mapping for the font 
to unicode >>/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences, but 
PDFbox doesnt use this mapping. If you use ExtractText it extracts the raw 
control characters instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to