[jira] Updated: (PDFBOX-582) Ignoring text over images

Villu Ruusmann (JIRA) Wed, 09 Dec 2009 07:29:41 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Villu Ruusmann updated PDFBOX-582:
----------------------------------

    Attachment: pg_0005.png

Sure, there are many things to bear in mind.

In my case, the underlying image is located at coordinates [0:0] and is 
stretched all over page size. When rendering this document with PDFBox 
0.8.0-incubating, the image is not displayed (something wrong with 
CCITTFaxDecode?), but the text is, albeit shouldn't be.

Note that the OCRed text is full of mistakes (especially the table section) - 
if the underlying image was visible, it would look awkward.

> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-582) Ignoring text over images

Reply via email to