Re: Nutch not recognizing html pages/images retrieved via php

Julien Nioche Mon, 05 Oct 2015 01:21:51 -0700

Hi

What happens is that parse-tika is used by default but doesn't know what to
do with that mime type.


You can edit parse-plugins.xml
<https://github.com/apache/nutch/blob/trunk/conf/parse-plugins.xml> and add

<mimeType name="x-php">
<plugin id="parse-html" />
</mimeType>


to map the mime type to the html parser. Obviously you'll need parse-html
to be active.

HTH

Julien



On 4 October 2015 at 03:01, Girish Rao <[email protected]> wrote:

> Hi,
>
> I am running a crawl on a website that serves pages and images via php.
> Nutch doesn’t seem to crawl these pages.
>
> I see the below in the hadoop.log
> 015-10-03 12:48:31,091 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> text/x-php, but they are not mapped to it  in the parse-plugins.xml file
> 2015-10-03 12:48:31,712 ERROR tika.TikaParser - Can't retrieve Tika parser
> for mime-type text/x-php
> 2015-10-03 12:48:31,713 WARN  parse.ParseSegment - Error parsing:
> http://www.arguntrader.com/ucp.php?mode=login: failed(2,0): Can't
> retrieve Tika parser for mime-type text/x-php
>
> Can anyone help with identifying what is to be done to crawl a site which
> serves pages via php?
>
> Regards
> Girish




-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Nutch not recognizing html pages/images retrieved via php

Reply via email to