[ 
https://issues.apache.org/jira/browse/PDFBOX-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631195#comment-14631195
 ] 

Tilman Hausherr commented on PDFBOX-2890:
-----------------------------------------

There is no such thing as "wrapping" in PDFs. There are just lines with glyphs 
at a certain position. Here are two lines of the first "paragraph":
{code}
BT
/F1 11.04 Tf
1 0 0 1 72.024 716.74 Tm
0 g
0 G
[ (This is ), 10, (m), -4, (y), -3, ( ), 9, (te), -3, (s), 11, (t ), -3, (d), 
3, (o), 5, (cum), 8, (ent, ), 8, (which), 5, ( has a s), 8, (entenc), 11, (e ), 
-3, (tha), 13, (t ), -3, (i), 13, (s lo), -5, (n), 3, (g), 4, ( e), -3, (n), 
14, (o), -5, (u), 3, (g), 4, (h), 3, ( t), 7, (o), -5, ( w), -4, (rap), 16, ( 
o), 3, (v), -4, (er), 10, ( t), -3, (w), 8, (o), -5, ( lin), 4, (es), 9, ( bu), 
4, (t ), -3, (w), 8, (e )] TJ
ET
BT
1 0 0 1 72.024 701.14 Tm
[ (want it ), 6, (to), 3, ( app), 5, (ear as), 9, ( a ), -3, (sin), 5, (g), 4, 
(l), 13, (e ), -3, (li), 3, (n), 3, (e ), 7, (when w), 8, (e ), -3, (e), 9, 
(xt), -3, (rac), 12, (t )] TJ
ET
{code}

> Wrapped lines in PDF files not processed correctly
> --------------------------------------------------
>
>                 Key: PDFBOX-2890
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2890
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: James Baker
>              Labels: wrapping
>
> Text that wraps over multiple lines in PDF documents is not extracted 
> correctly by PDFBox. The expected behaviour would be for it to be extracted 
> as a single line, but instead a line break is inserted at each wrap point.
> This makes it hard, if not impossible, to reassemble text into it's intended 
> form, as it is not known whether a line break in the extracted text is one 
> that appeared in the document or one that was inserted by PDFBox.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to