Strange behavior in TextPositionComparator
------------------------------------------
Key: PDFBOX-1170
URL: https://issues.apache.org/jira/browse/PDFBOX-1170
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.6.0, 1.7.0
Environment: Windows
Reporter: Sébastien Dailly
Priority: Minor
When extracting text for the pdf (see attachement) with
setSortByPosition(true), the output does not follow nor the visual position of
the elements, nor the document structure.
Here is the output of PDfTextStripper :
11111 333333333333333 : 222222222
The expected output would be :
11111 : 222222222 333333333333333
The string « 11111 : » is defined in only one instruction :
[(1) -9.555729866 (1) 17.5939998627 (1) 3.5597500801 (1) 1.9403500557 (1)
4.1794600487 ( ) -0.1493600011 (:) -4.7775301933 ( ) 250 ] TJ
How explain that the 3... is inserted inside ?
(Note : the pdf has been deflated and edited for « anonymising » the text. I
also removed a picture, wich explain the XRef error )
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira