[
https://issues.apache.org/jira/browse/TIKA-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610474#comment-14610474
]
Tim Allison commented on TIKA-1671:
-----------------------------------
Thank you for raising this. Please see TIKA-1641 for the same type of issue, I
think. If you can give pure PDFBox-app's ExtractText a try and see if you get
the same result, that'd be great. If you get the same result, then
unfortunately, it is beyond the scope of Tika to recombine lines. If you get
what you want, then there may be something in Tika that we can fix.
> Wrapped lines in PDF files not processed correctly
> --------------------------------------------------
>
> Key: TIKA-1671
> URL: https://issues.apache.org/jira/browse/TIKA-1671
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.9
> Reporter: James Baker
> Labels: pdf, wrapping
> Attachments: Test Document.pdf
>
>
> Text that wraps over multiple lines in PDF documents is not extracted
> correctly by Tika. The expected behaviour would be for it to be extracted as
> a single line, but instead a line break is inserted at each wrap point.
> This makes it hard, if not impossible, to reassemble text into it's intended
> form, as it is not known whether a line break in the extracted text is one
> that appeared in the document or one that was inserted by Tika.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)