[jira] [Comment Edited] (TIKA-3431) Using any setting other than AUTO or NO_OCR for X-Tika-PDFOcrStrategy causes remarkable performance loss

Sal (Jira) Wed, 02 Jun 2021 08:09:27 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355788#comment-17355788
 ]


Sal edited comment on TIKA-3431 at 6/2/21, 3:08 PM:
----------------------------------------------------

I think I understand what's happening now. After testing some more, I found 
that setting 'ocr_and_text_extraction' converts each PDF page to image and runs 
OCR on it AND extracts text from the page.  The output for each page is a 
duplicate containing the OCR'ed portion followed by the text portion.  This is 
why those settings take so long to complete.  

What I wanted was to extract the text and OCR only the inline images, not OCR 
the entire page.  So that requires the following settings

Content-Type: *application/pdf*

X-Tika-PDFextractInlineImages: *true*

X-Tika-PDFOcrStrategy: *auto* or *no_ocr*

I was under the impression that ocr strategy was something to do with how Tika 
extracted text in general (i.e. applied OCR to the inline images or not) 


was (Author: sallas):
I think I understand what's happening now. After testing some more, I found 
that setting 'ocr_and_text_extraction' converts each PDF page to image and runs 
OCR on it AND extracts text from the page.  The output for each page is a 
duplicate containing the OCR'ed portion followed by the text portion.  This is 
why those settings take so long to complete.  

What I wanted was to extract text and OCR only the inline images, not the 
entire page.  So that requires the following settings

Content-Type: *application/pdf*

**X-Tika-PDFextractInlineImages: *true*

X-Tika-PDFOcrStrategy: *auto* or *no_ocr*

I was under the impression that ocr strategy was something to do with how Tika 
extracted text in general (i.e. applied OCR to the inline images or not) 

> Using any setting other than AUTO or NO_OCR for X-Tika-PDFOcrStrategy causes 
> remarkable performance loss
> --------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3431
>                 URL: https://issues.apache.org/jira/browse/TIKA-3431
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-server
>    Affects Versions: 1.26
>            Reporter: Sal
>            Priority: Minor
>         Attachments: wiki.pdf
>
>
> When processing PDF document to the local Tika server using PUT request to 
> endpoint [http://localhost:9998/tika.]  If the PDFOcrStrategy is set to 
> anything other than AUTO or NO_OCR, this causes extreme slowdown in 
> processing of the PDF file.  
>  
> It doesn't matter if the PDF document has inline images or not, the slowdown 
> happens regardless.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3431) Using any setting other than AUTO or NO_OCR for X-Tika-PDFOcrStrategy causes remarkable performance loss

Reply via email to