Ignoring text over images
-------------------------

                 Key: PDFBOX-582
                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction, Utilities
    Affects Versions: 0.8.0-incubator
            Reporter: Villu Ruusmann


Scientific publishers often publish older articles (year 2000 and earlier) in 
scanned form. However, sometimes they seem to have conducted OCR, and added the 
recovered text as an overlay in order to give the end user a "native PDF" 
feeling in a sense that it is possible to copy and paste text.

PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part and 
the textual overlay part, which may produce confusing results.

Actually, there are two separate cases:
*) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
image part and ignore the text part.
*) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to