I keep getting this Tika error when I am using nutch. 

Can't retrieve Tika parser for mime-type text/css
Can't retrieve Tika parser for mime-type application/javascript
Can't retrieve Tika parser for mime-type text/x-php
Can't retrieve Tika parser for mime-type text/aspdotnet

I haven't actually do any particular configuration about. All is default.

Parse-plugin.xml
<parse-plugins>

  
        <mimeType name="*">
          <plugin id="parse-tika" />
        </mimeType>
 
        <mimeType name="application/rss+xml">
            <plugin id="parse-tika" />
            <plugin id="feed" />
        </mimeType>

        <mimeType name="application/x-bzip2">
                
                <plugin id="parse-zip" />
        </mimeType>

        <mimeType name="application/x-gzip">
                
                <plugin id="parse-zip" />
        </mimeType>

        <mimeType name="application/x-javascript">
                <plugin id="parse-js" />
        </mimeType>

        <mimeType name="application/x-shockwave-flash">
                <plugin id="parse-swf" />
        </mimeType>

        <mimeType name="application/zip">
                <plugin id="parse-zip" />
        </mimeType>

        <mimeType name="text/html">
                <plugin id="parse-html" />
        </mimeType>

        <mimeType name="application/xhtml+xml">
                <plugin id="parse-html" />
        </mimeType>

        <mimeType name="text/xml">
                <plugin id="parse-tika" />
                <plugin id="feed" />
        </mimeType>

       

        <mimeType name="application/vnd.nutch.example.cat">
                <plugin id="parse-ext" />
        </mimeType>

        <mimeType name="application/vnd.nutch.example.md5sum">
                <plugin id="parse-ext" />
        </mimeType>

        
        <aliases>
                <alias name="parse-tika" 
                        extension-id="org.apache.nutch.parse.tika.TikaParser" />
                <alias name="parse-ext" extension-id="ExtParser" />
                <alias name="parse-html"
                        extension-id="org.apache.nutch.parse.html.HtmlParser" />
                <alias name="parse-js" extension-id="JSParser" />
                <alias name="feed"
                        extension-id="org.apache.nutch.parse.feed.FeedParser" />
                <alias name="parse-swf"
                        extension-id="org.apache.nutch.parse.swf.SWFParser" />
                <alias name="parse-zip"
                        extension-id="org.apache.nutch.parse.zip.ZipParser" />
        </aliases>
        
</parse-plugins>

nutch-site.xml

<property>
  <name>plugin.includes</name>
   
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin.
By
    default Nutch includes crawling just HTML and plain text via HTTP,
    and basic indexing and search plugins. In order to use HTTPS please
enable 
    protocol-httpclient, but be aware of possible intermittent problems with
the 
    underlying commons-httpclient library.
    </description>
  </property>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tika-parsing-tp4232582.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to