[
https://issues.apache.org/jira/browse/TIKA-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326129#comment-14326129
]
Tim Allison commented on TIKA-1552:
-----------------------------------
Thank you for raising this. I just tried pdfbox-app-1.8.8.jar's ExtractText on
the supplied document, and the tabs show up there too. I'm not sure that this
is something we can control at the Tika level.
> Pdf document parser
> -------------------
>
> Key: TIKA-1552
> URL: https://issues.apache.org/jira/browse/TIKA-1552
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.7
> Reporter: Konstantin
> Attachments: 2014_US_Federal_Budget.pdf, issue.jpg
>
>
> Hello,
> We found that when a pdf document has marked text inside frame (table) then
> after parsing Tika insert tabs between words.
> Original text from attached file:
> Provides $17.7 billion in discretionary funding for the National Aeronautics
> and Space
> Parsed text (jira removed tabs, so i will add -> symbols instead):
> • Provides -> $17.7 ->
> billion->in->discretionary->funding->for->the->National->Aeronautics->and->Space
> Please take a look in attached screenshot.
> Thank you.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)