Nirmal Tandel created PDFBOX-6188:
-------------------------------------

             Summary: PDFTextStripper misses text occurrences in PDFs with 
out-of-order character drawing when setSortByPosition(false)
                 Key: PDFBOX-6188
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6188
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 3.0.7 PDFBox, 2.0.29
            Reporter: Nirmal Tandel
         Attachments: A151_src.pdf, A403_ref.pdf

When using {{PDFTextStripper}} to search for text in a vector PDF, not all 
occurrences of the search string are found. The root cause is that the PDF 
content stream draws characters in non-left-to-right visual order. With 
{{setSortByPosition(false)}} (the default), PDFBox respects drawing order and 
produces garbled token groupings, causing text searches to miss valid matches. 
With {{{}setSortByPosition(true){}}}, PDFBox fixes those cases but breaks 
extraction of PDFs containing rotated (e.g. 45-degree) text, where it groups 
diagonal glyphs with horizontal ones incorrectly.
h3. Steps to Reproduce
 # Open the affected PDF page ({{{}A151{}}}) using {{{}PDDocument.load(...){}}}.
 # Use {{PDFTextStripper}} to extract text or locate all occurrences of the 
string {{A403}} via {{PDFTextStripperByArea}} or a custom subclass.
 # With {{setSortByPosition(false)}} (default): only *2 of the 4* actual 
occurrences of {{A403 }}on the page are found.
 # With {{{}setSortByPosition(true){}}}: more occurrences are found on this 
page, but other PDFs whose content streams contain 45-degree / diagonal text 
are broken — PDFBox merges diagonal glyphs with horizontal glyphs, producing 
incorrect word groupings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to