[ 
https://issues.apache.org/jira/browse/SOLR-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated SOLR-7137:
------------------------------------
    Priority: Major  (was: Blocker)

> Upgrade to Tika 1.7 in 4_10_3 branch
> ------------------------------------
>
>                 Key: SOLR-7137
>                 URL: https://issues.apache.org/jira/browse/SOLR-7137
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 4.10.3
>            Reporter: Chris A. Mattmann
>             Fix For: 4.10.4
>
>         Attachments: SOLR-7137.Mattmann.022115.patch.txt
>
>
> I have been trying out SolrCell as an alternative to ingesting around 40M 
> images using Tesseract/OCR and Tika. I noticed in 4.10.3 Tika is pinned to 
> 1.5. In 1.5 Tika and in SolrCell 4.10.3, only about 5600 images of a subset 
> of 50,000 are ingested when I run a series of 50k cURL commands to the 
> extract handler. I had a feeling it has something to do with the fact that 
> some of the characters extracted are oddball characters (4@#@#/ ^^^^) due to 
> Tesseract not always extracting the right text. But then I remembered 
> Tesseract didn't land in Tika until 1.7.
> So regardless, I thought I'd upgrade the 4.10.x branch to Tika 1.7. This is a 
> trivial patch to do so, attached (Tika + compress updates). Now all 50K 
> images on the 50K subset are ingested, but I'm noticing something else weird. 
> Despite the fact that Tesseract is called, and despite the fact that on 
> certain images I can verify text is extracted by running Tesseract from the 
> command line on that file, all I am getting in the "content" field of 
> SolrCell is a bunch of "\n \n \n \n \n \n" text. So the text is extracted, 
> there are weird characters, but they don't make it into Solr. Extremely odd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to