[ 
https://issues.apache.org/jira/browse/TIKA-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin updated TIKA-1552:
-----------------------------
    Description: 
Hello,
We found that when a pdf document has marked text inside frame (table) then 
after parsing Tika insert tabs between words.
Original text:
Provides $17.7 billion in discretionary funding for the National Aeronautics 
and Space

Parsed text (jira removed tabs, so i will add -> symbols instead):
•        Provides -> 
$17.7->billion->in->discretionary->funding->for->the->National->Aeronautics->and->Space

Thank you.

  was:
Hello,
We found that when a pdf document has marked text inside frame (table) then 
after parsing Tika insert tabs between words.
Original text:
Provides $17.7 billion in discretionary funding for the National Aeronautics 
and Space

Parsed text (jira removed tabs, so i will add -> symbols instdead):
•        Provides       $17.7   billion in      discretionary   funding for     
the     National        Aeronautics     and     Space

Thank you.


> Pdf document parser
> -------------------
>
>                 Key: TIKA-1552
>                 URL: https://issues.apache.org/jira/browse/TIKA-1552
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7
>            Reporter: Konstantin
>
> Hello,
> We found that when a pdf document has marked text inside frame (table) then 
> after parsing Tika insert tabs between words.
> Original text:
> Provides $17.7 billion in discretionary funding for the National Aeronautics 
> and Space
> Parsed text (jira removed tabs, so i will add -> symbols instead):
> •        Provides -> 
> $17.7->billion->in->discretionary->funding->for->the->National->Aeronautics->and->Space
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to