[
https://issues.apache.org/jira/browse/PDFBOX-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631195#comment-14631195
]
Tilman Hausherr commented on PDFBOX-2890:
-----------------------------------------
There is no such thing as "wrapping" in PDFs. There are just lines with glyphs
at a certain position. Here are two lines of the first "paragraph":
{code}
BT
/F1 11.04 Tf
1 0 0 1 72.024 716.74 Tm
0 g
0 G
[ (This is ), 10, (m), -4, (y), -3, ( ), 9, (te), -3, (s), 11, (t ), -3, (d),
3, (o), 5, (cum), 8, (ent, ), 8, (which), 5, ( has a s), 8, (entenc), 11, (e ),
-3, (tha), 13, (t ), -3, (i), 13, (s lo), -5, (n), 3, (g), 4, ( e), -3, (n),
14, (o), -5, (u), 3, (g), 4, (h), 3, ( t), 7, (o), -5, ( w), -4, (rap), 16, (
o), 3, (v), -4, (er), 10, ( t), -3, (w), 8, (o), -5, ( lin), 4, (es), 9, ( bu),
4, (t ), -3, (w), 8, (e )] TJ
ET
BT
1 0 0 1 72.024 701.14 Tm
[ (want it ), 6, (to), 3, ( app), 5, (ear as), 9, ( a ), -3, (sin), 5, (g), 4,
(l), 13, (e ), -3, (li), 3, (n), 3, (e ), 7, (when w), 8, (e ), -3, (e), 9,
(xt), -3, (rac), 12, (t )] TJ
ET
{code}
> Wrapped lines in PDF files not processed correctly
> --------------------------------------------------
>
> Key: PDFBOX-2890
> URL: https://issues.apache.org/jira/browse/PDFBOX-2890
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: James Baker
> Labels: wrapping
>
> Text that wraps over multiple lines in PDF documents is not extracted
> correctly by PDFBox. The expected behaviour would be for it to be extracted
> as a single line, but instead a line break is inserted at each wrap point.
> This makes it hard, if not impossible, to reassemble text into it's intended
> form, as it is not known whether a line break in the extracted text is one
> that appeared in the document or one that was inserted by PDFBox.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]