[
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Villu Ruusmann updated PDFBOX-582:
----------------------------------
Attachment: pg_0005.png
Sure, there are many things to bear in mind.
In my case, the underlying image is located at coordinates [0:0] and is
stretched all over page size. When rendering this document with PDFBox
0.8.0-incubating, the image is not displayed (something wrong with
CCITTFaxDecode?), but the text is, albeit shouldn't be.
Note that the OCRed text is full of mistakes (especially the table section) -
if the underlying image was visible, it would look awkward.
> Ignoring text over images
> -------------------------
>
> Key: PDFBOX-582
> URL: https://issues.apache.org/jira/browse/PDFBOX-582
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction, Utilities
> Affects Versions: 0.8.0-incubator
> Reporter: Villu Ruusmann
> Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in
> scanned form. However, sometimes they seem to have conducted OCR, and added
> the recovered text as an overlay in order to give the end user a "native PDF"
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0,
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the
> image part and work upon the text part.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.