I keep getting this Tika error when I am using nutch.
Can't retrieve Tika parser for mime-type text/css
Can't retrieve Tika parser for mime-type application/javascript
Can't retrieve Tika parser for mime-type text/x-php
Can't retrieve Tika parser for mime-type text/aspdotnet
I haven't actually do any particular configuration about. All is default.
Parse-plugin.xml
<parse-plugins>
<mimeType name="*">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="application/rss+xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
<mimeType name="application/x-bzip2">
<plugin id="parse-zip" />
</mimeType>
<mimeType name="application/x-gzip">
<plugin id="parse-zip" />
</mimeType>
<mimeType name="application/x-javascript">
<plugin id="parse-js" />
</mimeType>
<mimeType name="application/x-shockwave-flash">
<plugin id="parse-swf" />
</mimeType>
<mimeType name="application/zip">
<plugin id="parse-zip" />
</mimeType>
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>
<mimeType name="text/xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
<mimeType name="application/vnd.nutch.example.cat">
<plugin id="parse-ext" />
</mimeType>
<mimeType name="application/vnd.nutch.example.md5sum">
<plugin id="parse-ext" />
</mimeType>
<aliases>
<alias name="parse-tika"
extension-id="org.apache.nutch.parse.tika.TikaParser" />
<alias name="parse-ext" extension-id="ExtParser" />
<alias name="parse-html"
extension-id="org.apache.nutch.parse.html.HtmlParser" />
<alias name="parse-js" extension-id="JSParser" />
<alias name="feed"
extension-id="org.apache.nutch.parse.feed.FeedParser" />
<alias name="parse-swf"
extension-id="org.apache.nutch.parse.swf.SWFParser" />
<alias name="parse-zip"
extension-id="org.apache.nutch.parse.zip.ZipParser" />
</aliases>
</parse-plugins>
nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems with
the
underlying commons-httpclient library.
</description>
</property>
--
View this message in context:
http://lucene.472066.n3.nabble.com/Tika-parsing-tp4232582.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.