[
https://issues.apache.org/jira/browse/TIKA-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072483#comment-16072483
]
Tim Allison commented on TIKA-2403:
-----------------------------------
Y. Thank you. Sorry for the delay. The text you don't want comes from the
PDF's bookmarks. You can turn this off with a tika-config.xml...see:
https://wiki.apache.org/tika/TikaConfig
I'm not sure how to specify the tika-config.xml in ES, but I would hope that
that is straightforward.
Let us know if you have any other questions.
> Elasticsearch 5.2.2 - Ingest Node - PDF - Parsing Issue
> -------------------------------------------------------
>
> Key: TIKA-2403
> URL: https://issues.apache.org/jira/browse/TIKA-2403
> Project: Tika
> Issue Type: Bug
> Reporter: Boopathi
> Attachments: SampleDocument.pdf
>
>
> We are using Elasticsearch 5.2.2 for Full text search. With the help of
> ingest node we are able to parse the content of files which tika supports. We
> are facing some issue while parsing the content of some PDF files . It parsed
> the content of file successfully and in addition to that some additional
> terms which is not even the content of that document. [sample screen
> shot|https://www.screencast.com/t/AQWK9Rzvrdo8]. Kindly let me know what is
> reason for this and how can it be fixed
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)