Re: Tika parsing

Sebastian Nagel Sun, 04 Oct 2015 12:07:57 -0700

Hi,

this is not necessarily a problem.
It may be the case that Tika does not (yet)
provide parsers for these document types.
Unless you really want to "read" this
documents, it does not matter, it's just
warning.


Sebastian


On 10/03/2015 07:47 PM, Taichi Ho wrote:
> I keep getting this Tika error when I am using nutch. 
> 
> Can't retrieve Tika parser for mime-type text/css
> Can't retrieve Tika parser for mime-type application/javascript
> Can't retrieve Tika parser for mime-type text/x-php
> Can't retrieve Tika parser for mime-type text/aspdotnet
> 
> I haven't actually do any particular configuration about. All is default.
> 
> Parse-plugin.xml
> <parse-plugins>
> 
>   
>       <mimeType name="*">
>         <plugin id="parse-tika" />
>       </mimeType>
>  
>       <mimeType name="application/rss+xml">
>           <plugin id="parse-tika" />
>           <plugin id="feed" />
>       </mimeType>
> 
>       <mimeType name="application/x-bzip2">
>               
>               <plugin id="parse-zip" />
>       </mimeType>
> 
>       <mimeType name="application/x-gzip">
>               
>               <plugin id="parse-zip" />
>       </mimeType>
> 
>       <mimeType name="application/x-javascript">
>               <plugin id="parse-js" />
>       </mimeType>
> 
>       <mimeType name="application/x-shockwave-flash">
>               <plugin id="parse-swf" />
>       </mimeType>
> 
>       <mimeType name="application/zip">
>               <plugin id="parse-zip" />
>       </mimeType>
> 
>       <mimeType name="text/html">
>               <plugin id="parse-html" />
>       </mimeType>
> 
>         <mimeType name="application/xhtml+xml">
>               <plugin id="parse-html" />
>       </mimeType>
> 
>       <mimeType name="text/xml">
>               <plugin id="parse-tika" />
>               <plugin id="feed" />
>       </mimeType>
> 
>        
> 
>       <mimeType name="application/vnd.nutch.example.cat">
>               <plugin id="parse-ext" />
>       </mimeType>
> 
>       <mimeType name="application/vnd.nutch.example.md5sum">
>               <plugin id="parse-ext" />
>       </mimeType>
> 
>       
>       <aliases>
>               <alias name="parse-tika" 
>                       extension-id="org.apache.nutch.parse.tika.TikaParser" />
>               <alias name="parse-ext" extension-id="ExtParser" />
>               <alias name="parse-html"
>                       extension-id="org.apache.nutch.parse.html.HtmlParser" />
>               <alias name="parse-js" extension-id="JSParser" />
>               <alias name="feed"
>                       extension-id="org.apache.nutch.parse.feed.FeedParser" />
>               <alias name="parse-swf"
>                       extension-id="org.apache.nutch.parse.swf.SWFParser" />
>               <alias name="parse-zip"
>                       extension-id="org.apache.nutch.parse.zip.ZipParser" />
>       </aliases>
>       
> </parse-plugins>
> 
> nutch-site.xml
> 
> <property>
>   <name>plugin.includes</name>
>    
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>     <description>Regular expression naming plugin directory names to
>     include.  Any plugin not matching this expression is excluded.
>     In any case you need at least include the nutch-extensionpoints plugin.
> By
>     default Nutch includes crawling just HTML and plain text via HTTP,
>     and basic indexing and search plugins. In order to use HTTPS please
> enable 
>     protocol-httpclient, but be aware of possible intermittent problems with
> the 
>     underlying commons-httpclient library.
>     </description>
>   </property>
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Tika-parsing-tp4232582.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>

Re: Tika parsing

Reply via email to