[
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809855#comment-16809855
]
Tim Allison edited comment on TIKA-2749 at 4/4/19 4:49 PM:
-----------------------------------------------------------
There are several reasons why one might want to run OCR on a PDF page. It
might be useful to catalog those here along with a diagnostic. I offer this as
a first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images
cover x% of the page|might be a non-text containing picture or might be an
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual
underlying image file||
|Scanned PDF|inline images cover x% of the page; text is extracted but it might
be garbled (depending on quality of original scan);what are other signs of a
scanned PDF???|As OCR improves or if you build a custom model, it might be
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else
can we automatically identify this?| |
was (Author: [email protected]):
There are several reasons why one might want to run OCR on a PDF page. It
might be useful to catalog those here along with a diagnostic. I offer this as
a first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images
cover x% of the page|might be a non-text containing picture or might be an
image of text...who knows?|
|"Drawn" image|I've seen PDFs that "draw" an image...with no actual underlying
image file|I can't remember the name for this...help!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might
be garbled (depending on quality of original scan);what are other signs of a
scanned PDF???|As OCR improves or if you build a custom model, it might be
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else
can we automatically identify this?| |
> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on
> inline images within PDFs. The user has to 1) understand that these are
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid
> strategy between the 2 options. Users should still be allowed to configure
> as they wish, of course.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)