[
https://issues.apache.org/jira/browse/TIKA-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631200#comment-14631200
]
Tim Allison commented on TIKA-1671:
-----------------------------------
I think this is an issue with PDFs in general, not PDFBox. I _think_ that
software that generates the PDF can choose to include "accessible" text, which
is reading-order actual Unicode/reliable text. However, if that doesn't exist,
then we have to pull out text based on presentation instructions, which will
break words over lines etc. Oh, and my favorite, sometimes spaces aren't
stored in the pdfs, and you have to guess where thewordbreaksare based on
computations on character widths.
> Wrapped lines in PDF files not processed correctly
> --------------------------------------------------
>
> Key: TIKA-1671
> URL: https://issues.apache.org/jira/browse/TIKA-1671
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.9
> Reporter: James Baker
> Labels: pdf, wrapping
> Attachments: Test Document.pdf
>
>
> Text that wraps over multiple lines in PDF documents is not extracted
> correctly by Tika. The expected behaviour would be for it to be extracted as
> a single line, but instead a line break is inserted at each wrap point.
> This makes it hard, if not impossible, to reassemble text into it's intended
> form, as it is not known whether a line break in the extracted text is one
> that appeared in the document or one that was inserted by Tika.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)