[
https://issues.apache.org/jira/browse/SOLR-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18044471#comment-18044471
]
Álvaro Lechner commented on SOLR-18022:
---------------------------------------
When I send a scanned PDF along with its metadata to Solr 9.6.1, the internal
Tika does not invoke OCR, resulting in only the metadata being indexed, and no
document content.
However, in 9.10, Tika attempts to perform OCR. If this attempt fails, Tika is
restarted, and Solr fails to index anything at all (neither content nor
metadata).
This might be the expected behavior, but I believe it would be beneficial to
have a configuration option to ignore OCR failure and index the document based
only on its metadata.
I am unsure whether this configuration should be handled by Solr or Tika.
In my application, I have numerous indexed documents. Some of these documents
lack content because they require OCR. I can still successfully locate them
using other metadata I send to Solr. Among these, some are large PDF files
(over 20MB) where the OCR process can take over 30 seconds to run.
> Solr don't index sent metadata when external Tika fails
> -------------------------------------------------------
>
> Key: SOLR-18022
> URL: https://issues.apache.org/jira/browse/SOLR-18022
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: contrib - Solr Cell (Tika extraction)
> Affects Versions: 9.10
> Reporter: Álvaro Lechner
> Priority: Major
>
> When I send a big pdf to solr and Tika OCR causes time out, solr don't index
> the metadata sent.
> This occurs when solr httpclient timed out or if Tika timed out and drop
> connection
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]