Strange behavior in TextPositionComparator
------------------------------------------

                 Key: PDFBOX-1170
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1170
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.6.0, 1.7.0
         Environment: Windows
            Reporter: Sébastien Dailly
            Priority: Minor


When extracting text for the pdf (see attachement) with 
setSortByPosition(true), the output does not follow nor the visual position of 
the elements, nor the document structure.

Here is the output of PDfTextStripper :

11111 333333333333333 : 222222222 

The expected output would be :

11111 : 222222222 333333333333333 

The string « 11111 : » is defined in only one instruction :

 [(1) -9.555729866 (1) 17.5939998627 (1) 3.5597500801 (1) 1.9403500557 (1) 
4.1794600487 ( ) -0.1493600011 (:) -4.7775301933 ( ) 250 ] TJ

How explain that the 3... is inserted inside ?

(Note : the pdf has been deflated and edited for « anonymising » the text. I 
also removed a picture, wich explain the XRef error )

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to