[jira] Commented: (SOLR-2116) TikaEntityProcessor does not find parser by default

Chris A. Mattmann (JIRA) Mon, 03 Jan 2011 18:04:11 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977072#action_12977072
 ]


Chris A. Mattmann commented on SOLR-2116:
-----------------------------------------

Hey Lance,

bq. Speaking of Tika, have you ever seen a tikaconfig file? I can't find on 
anywhere on the web or the Tika source

In the later versions of Tika (I think since 0.7) we've went to an all Service 
Provider Interface (SPI) mechanism for Parser config and resource loading, 
obviating the need to have a tika config.xml file:

https://issues.apache.org/jira/browse/TIKA-317

However, you still have the option of specifying and using one. See:

http://svn.apache.org/repos/asf/tika/tags/0.8/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java

You can find an example of the XML-based Tika config here:

http://svn.apache.org/repos/asf/tika/tags/0.6/tika-core/src/main/resources/org/apache/tika/

Part of this is also due to the ParseContext which was introduced also as a 
configuration mechanism. See:

https://issues.apache.org/jira/browse/TIKA-275

Cheers,
Chris




> TikaEntityProcessor does not find parser by default
> ---------------------------------------------------
>
>                 Key: SOLR-2116
>                 URL: https://issues.apache.org/jira/browse/SOLR-2116
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika 
> extraction)
>    Affects Versions: 3.1, 4.0
>            Reporter: Lance Norskog
>         Attachments: pdflist-data-config.xml, pdflist.xml, SOLR-2116.patch
>
>
> The TikaEntityProcessor does not find the correct document parser by default.
> This is in a two-level DIH config file. I have attached 
> pdflist-data-config.xml and pdflist.xml, the XML file list supplying. To test 
> this, you will need the current 3.x branch or 4.0 trunk.
> # Set up a Tika-enabled Solr 
> # copy any PDF file to /tmp/testfile.pdf
> # copy the pdflist-data-config.xml to your solr/conf
> # and add this snippet to your solrconfig.xml
> {code:xml}
> <requestHandler name="/pdflist"
>       class="org.apache.solr.handler.dataimport.DataImportHandler">
>   <lst name="defaults">
>               <str name="config">pdflist-data-config.xml</str>
>       </lst>
> </requestHandler>
> {code}
> [http://localhost:8983/solr/pdflist?command=full-import] will make one 
> document with the id and text fields populated. If you remove this line:
> {code}
>  parser="org.apache.tika.parser.pdf.PDFParser"
> {code}
> from the TikaEntityProcessor entity, the parser will not be found and you 
> will get a document with the "id" field and nothing else.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2116) TikaEntityProcessor does not find parser by default

Reply via email to