[
https://issues.apache.org/jira/browse/SOLR-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe Schindler reopened SOLR-7137:
---------------------------------
Sorry, resolution was wrong
> Upgrade to Tika 1.7 in 4_10_3 branch
> ------------------------------------
>
> Key: SOLR-7137
> URL: https://issues.apache.org/jira/browse/SOLR-7137
> Project: Solr
> Issue Type: Improvement
> Components: contrib - Solr Cell (Tika extraction)
> Affects Versions: 4.10.3
> Reporter: Chris A. Mattmann
> Assignee: Uwe Schindler
> Attachments: SOLR-7137.Mattmann.022115.patch.txt
>
>
> I have been trying out SolrCell as an alternative to ingesting around 40M
> images using Tesseract/OCR and Tika. I noticed in 4.10.3 Tika is pinned to
> 1.5. In 1.5 Tika and in SolrCell 4.10.3, only about 5600 images of a subset
> of 50,000 are ingested when I run a series of 50k cURL commands to the
> extract handler. I had a feeling it has something to do with the fact that
> some of the characters extracted are oddball characters (4@#@#/ ^^^^) due to
> Tesseract not always extracting the right text. But then I remembered
> Tesseract didn't land in Tika until 1.7.
> So regardless, I thought I'd upgrade the 4.10.x branch to Tika 1.7. This is a
> trivial patch to do so, attached (Tika + compress updates). Now all 50K
> images on the 50K subset are ingested, but I'm noticing something else weird.
> Despite the fact that Tesseract is called, and despite the fact that on
> certain images I can verify text is extracted by running Tesseract from the
> command line on that file, all I am getting in the "content" field of
> SolrCell is a bunch of "\n \n \n \n \n \n" text. So the text is extracted,
> there are weird characters, but they don't make it into Solr. Extremely odd.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]