Hi What happens is that parse-tika is used by default but doesn't know what to do with that mime type.
You can edit parse-plugins.xml <https://github.com/apache/nutch/blob/trunk/conf/parse-plugins.xml> and add <mimeType name="x-php"> <plugin id="parse-html" /> </mimeType> to map the mime type to the html parser. Obviously you'll need parse-html to be active. HTH Julien On 4 October 2015 at 03:01, Girish Rao <[email protected]> wrote: > Hi, > > I am running a crawl on a website that serves pages and images via php. > Nutch doesn’t seem to crawl these pages. > > I see the below in the hadoop.log > 015-10-03 12:48:31,091 INFO parse.ParserFactory - The parsing plugins: > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > plugin.includes system property, and all claim to support the content type > text/x-php, but they are not mapped to it in the parse-plugins.xml file > 2015-10-03 12:48:31,712 ERROR tika.TikaParser - Can't retrieve Tika parser > for mime-type text/x-php > 2015-10-03 12:48:31,713 WARN parse.ParseSegment - Error parsing: > http://www.arguntrader.com/ucp.php?mode=login: failed(2,0): Can't > retrieve Tika parser for mime-type text/x-php > > Can anyone help with identifying what is to be done to crawl a site which > serves pages via php? > > Regards > Girish -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>

