[jira] [Updated] (PDFBOX-1127) PDF supplies glyph->unicode mapping, but PDFBox doesn't use it.

Robert Muir (Updated) (JIRA) Sun, 02 Oct 2011 13:10:59 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated PDFBOX-1127:
--------------------------------

    Attachment: encoding.jpg

Screenshot showing the glyph list in fontforge and the mapping from the PDF 
file and how they correspond
                
> PDF supplies glyph->unicode mapping, but PDFBox doesn't use it.
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-1127
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1127
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.7.0
>         Environment: Tested trunk r1177011
>            Reporter: Robert Muir
>         Attachments: ebrat.pdf, encoding.jpg
>
>
> We had a user report this PDF to the lucene lists: 
> http://www.lucidimagination.com/search/document/7a8c14a534d9a84c/tika_can_not_parse_all_of_the_persian_pdf_files
> I asked them to create a TIKA issue (TIKA-713) and attach the PDF file
> Upon inspection, the fonts used in the PDF have custom encodings (that map 
> the characters to U+0001, U+0002, ...), however they contain a mapping for 
> the font to unicode 
> >>/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences, but PDFbox doesnt 
> use this mapping. If you use ExtractText it extracts the raw control 
> characters instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1127) PDF supplies glyph->unicode mapping, but PDFBox doesn't use it.

Reply via email to