[
https://issues.apache.org/jira/browse/TIKA-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326146#comment-14326146
]
Tim Allison commented on TIKA-1552:
-----------------------------------
And Adobe Reader's save as text adds new lines:
{noformat}
•
Provides
$17.7
billion
in
discretionary
funding
for
the
National
Aeronautics
and
Space
Administration
(NASA),
a
decrease
of
0.3
percent,
or
about
$50
million,
below
the
2012
enacted
level.
While
making
tough
choices,
{noformat}
> Pdf document parser
> -------------------
>
> Key: TIKA-1552
> URL: https://issues.apache.org/jira/browse/TIKA-1552
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.7
> Reporter: Konstantin
> Attachments: 2014_US_Federal_Budget.pdf, issue.jpg
>
>
> Hello,
> We found that when a pdf document has marked text inside frame (table) then
> after parsing Tika insert tabs between words.
> Original text from attached file:
> Provides $17.7 billion in discretionary funding for the National Aeronautics
> and Space
> Parsed text (jira removed tabs, so i will add -> symbols instead):
> • Provides -> $17.7 ->
> billion->in->discretionary->funding->for->the->National->Aeronautics->and->Space
> Please take a look in attached screenshot.
> On the left side is the parsed text in text editor
> Thank you.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)