[ https://issues.apache.org/jira/browse/PDFBOX-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920168#action_12920168 ]
Saurabh Mehrotra commented on PDFBOX-860: ----------------------------------------- Hi It will not be possible to attach a sample pdf file. I will try to create a PDF file with those characters and upload the results. However an useful finding which might help you is that the output is fine when using 0.8.0 version of the PDF box but when we use the 1.2.1 version the chracters 'fi' get converted to '?' I will check if ligature characters are used in the PDF files. Thanks & Regards Saurabh > 'fi' getting converted to '?' > ----------------------------- > > Key: PDFBOX-860 > URL: https://issues.apache.org/jira/browse/PDFBOX-860 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.2.1 > Environment: Solaris 10 > Reporter: Saurabh Mehrotra > > Hi > I am trying to use PDF box 1.2.1 version to extract text from PDF files. > The following issue is observed in the extracted text: > 1. Combination of the characters 'fi' is converted to a '?' > example: first becomes ?rst > classifier becomes classi?er > find becomes ?nd > Is this a known bug? Can some setting of the PDF box be turned of to prevent > this? > Thanks & Regards > Saurabh -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.