[
https://issues.apache.org/jira/browse/NIFI-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063633#comment-18063633
]
Tilman Hausherr commented on NIFI-10218:
----------------------------------------
Developer Dan asked about this in the Tika mailing list. You project uses Tika
which uses PDFBox and PDFBox only extracts what is available. Glyph (visual
represenation) and character (code) are two different things, and sometimes
(like here) the unicode is incorrect or unavailable. You can try to do copy &
paste in Adobe Reader and you'll also get a bad result for that page. Same with
firefox.
See also the PDFBox FAQ:
https://pdfbox.apache.org/3.0/faq.html#text-extraction
> ExtractDocumentText processor does not handle certain characters when
> extracting from a PDF
> -------------------------------------------------------------------------------------------
>
> Key: NIFI-10218
> URL: https://issues.apache.org/jira/browse/NIFI-10218
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Reporter: Andrew M. Lim
> Priority: Minor
> Attachments: 625006.pdf, 625006_results.png, PDF_flow.json,
> example.pdf, example_results.png
>
>
> When a PDF has special characters ("+", "=",">", "+-"), when the text is
> extracted from the document, these characters show up with different symbols.
> I've attached two PDFs that illustrate the issue differently:
> * 625006.pdf has multiple pages. When the text is extracted from a table,
> certain characters show up as a ? symbol.
> * example.pdf is a single page with the same table. When the text is
> extracted the same characters show up as " or # symbols.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)