[ https://issues.apache.org/jira/browse/PDFBOX-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated PDFBOX-1127: -------------------------------- Attachment: encoding.jpg Screenshot showing the glyph list in fontforge and the mapping from the PDF file and how they correspond > PDF supplies glyph->unicode mapping, but PDFBox doesn't use it. > --------------------------------------------------------------- > > Key: PDFBOX-1127 > URL: https://issues.apache.org/jira/browse/PDFBOX-1127 > Project: PDFBox > Issue Type: Bug > Affects Versions: 1.7.0 > Environment: Tested trunk r1177011 > Reporter: Robert Muir > Attachments: ebrat.pdf, encoding.jpg > > > We had a user report this PDF to the lucene lists: > http://www.lucidimagination.com/search/document/7a8c14a534d9a84c/tika_can_not_parse_all_of_the_persian_pdf_files > I asked them to create a TIKA issue (TIKA-713) and attach the PDF file > Upon inspection, the fonts used in the PDF have custom encodings (that map > the characters to U+0001, U+0002, ...), however they contain a mapping for > the font to unicode > >>/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences, but PDFbox doesnt > use this mapping. If you use ExtractText it extracts the raw control > characters instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira