[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Tommaso Teofili (JIRA) Tue, 27 Jul 2010 15:09:44 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892957#action_12892957
 ]


Tommaso Teofili commented on SOLR-1902:
---------------------------------------

Hi all, I had the same issue David has, so I applied the patch (modifying files 
one by one) to a fresh Solr 1.4.1 checkout and I managed to have most of my 
PDFs being indexed with text extracted (with the "example" Solr instance). 
Within the apache-solr-1.4.1 release I substituted all the files inside 
apache-solr-1.4.1/dist with the ones generated (inside the dist directory) 
invoking 'ant dist' on the patched 1.4.1 source code, also I substituted the 
release war with the generated (patched) war inside example/webapps (this last 
one was mandatory to avoid the NoSuchMethodError reported above) . Then I ran 
'java -jar start.jar' from example dir and everything worked.
Note that I used the latest version of pdfbox, jembox and fontbox (1.2.1).
I can attach the patch to 1.4.1 code I used.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>
> See 
> http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an 
> EmptyParser when uploading docs, which then outputs an empty XHTML 
> representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Reply via email to