[jira] Commented: (TIKA-584) Tika parse of some PDF files removes all spaces between words

Ajay Vohra (JIRA) Sun, 16 Jan 2011 19:40:13 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982464#action_12982464
 ]


Ajay Vohra commented on TIKA-584:
---------------------------------

I tried the PDF files attached with TIKA-583, and Tika.parse(InputStream) 
parses it with the spaces intact. So, it does not appear that 583 and 584 are 
the same issues.

> Tika parse of some PDF files removes all spaces between words
> -------------------------------------------------------------
>
>                 Key: TIKA-584
>                 URL: https://issues.apache.org/jira/browse/TIKA-584
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: Windows XP 3, OpenSuse 11.2
>            Reporter: Ajay Vohra
>
> In the case of some pdf files (not all), when Tika.parse(InputStream) method 
> is used, the content extracted from the returned reader has all spaces 
> removed. This only happens for some PDF files: An example where this happens 
> is: JavaEE6Tutorial.pdf (available from Oracle). There are many such files 
> where this bug can be seen. I have even tried Tika snapshot 0.9 and the bug 
> remains.
> When PDFTextStripper is directly used, the extracted content is correct, with 
> the spaces between words retained.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-584) Tika parse of some PDF files removes all spaces between words

Reply via email to