[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

Tim Allison (JIRA) Thu, 04 Apr 2019 10:13:09 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809855#comment-16809855
 ]


Tim Allison edited comment on TIKA-2749 at 4/4/19 5:12 PM:
-----------------------------------------------------------

There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file...how do we identify vector graphics?|See for example 
PDFBOX-2475's [^rotation.pdf].  If we render the page, '2222', a vector 
graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR 
on the extracted inline images, OCR is never triggered because there are no 
inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |


was (Author: talli...@mitre.org):
There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file|See for example PDFBOX-2475's 
[rotation.pdf|https://issues.apache.org/jira/secure/attachment/12933778/rotation.pdf].
  If we render the page, '2222', a vector graphic, is OCR'd as '$225'; however, 
if we extract inline images and run OCR on the extracted inline images, OCR is 
never triggered because there are no inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

Reply via email to