[
https://issues.apache.org/jira/browse/TIKA-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982464#action_12982464
]
Ajay Vohra commented on TIKA-584:
---------------------------------
I tried the PDF files attached with TIKA-583, and Tika.parse(InputStream)
parses it with the spaces intact. So, it does not appear that 583 and 584 are
the same issues.
> Tika parse of some PDF files removes all spaces between words
> -------------------------------------------------------------
>
> Key: TIKA-584
> URL: https://issues.apache.org/jira/browse/TIKA-584
> Project: Tika
> Issue Type: Bug
> Affects Versions: 0.8
> Environment: Windows XP 3, OpenSuse 11.2
> Reporter: Ajay Vohra
>
> In the case of some pdf files (not all), when Tika.parse(InputStream) method
> is used, the content extracted from the returned reader has all spaces
> removed. This only happens for some PDF files: An example where this happens
> is: JavaEE6Tutorial.pdf (available from Oracle). There are many such files
> where this bug can be seen. I have even tried Tika snapshot 0.9 and the bug
> remains.
> When PDFTextStripper is directly used, the extracted content is correct, with
> the spaces between words retained.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.