[ 
https://issues.apache.org/jira/browse/SOLR-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18044471#comment-18044471
 ] 

Álvaro Lechner commented on SOLR-18022:
---------------------------------------

When I send a scanned PDF along with its metadata to Solr 9.6.1, the internal 
Tika does not invoke OCR, resulting in only the metadata being indexed, and no 
document content.

However, in 9.10, Tika attempts to perform OCR. If this attempt fails, Tika is 
restarted, and Solr fails to index anything at all (neither content nor 
metadata).

This might be the expected behavior, but I believe it would be beneficial to 
have a configuration option to ignore OCR failure and index the document based 
only on its metadata.

I am unsure whether this configuration should be handled by Solr or Tika.

In my application, I have numerous indexed documents. Some of these documents 
lack content because they require OCR. I can still successfully locate them 
using other metadata I send to Solr. Among these, some are large PDF files 
(over 20MB) where the OCR process can take over 30 seconds to run.

> Solr don't index sent metadata when external Tika fails
> -------------------------------------------------------
>
>                 Key: SOLR-18022
>                 URL: https://issues.apache.org/jira/browse/SOLR-18022
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 9.10
>            Reporter: Álvaro Lechner
>            Priority: Major
>
> When I send a big pdf to solr and Tika OCR causes time out, solr don't index 
> the metadata sent.
> This occurs when solr httpclient timed out or if Tika timed out and drop 
> connection



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to