[jira] Commented: (TIKA-584) Tika parse of some PDF files removes all spaces between words

Ken Krugler (JIRA) Sat, 15 Jan 2011 11:18:08 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982153#action_12982153
 ]


Ken Krugler commented on TIKA-584:
----------------------------------

This looks like the same issue as TIKA-583

If so, please link this issue and close it as a duplicate.

Thanks,

-- Ken

> Tika parse of some PDF files removes all spaces between words
> -------------------------------------------------------------
>
>                 Key: TIKA-584
>                 URL: https://issues.apache.org/jira/browse/TIKA-584
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: Windows XP 3, OpenSuse 11.2
>            Reporter: Ajay Vohra
>
> In the case of some pdf files (not all), when Tika.parse(InputStream) method 
> is used, the content extracted from the returned reader has all spaces 
> removed. This only happens for some PDF files: An example where this happens 
> is: JavaEE6Tutorial.pdf (available from Oracle). There are many such files 
> where this bug can be seen. I have even tried Tika snapshot 0.9 and the bug 
> remains.
> When PDFTextStripper is directly used, the extracted content is correct, with 
> the spaces between words retained.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-584) Tika parse of some PDF files removes all spaces between words

Reply via email to