Nutch not recognizing html pages/images retrieved via php

Girish Rao Sat, 03 Oct 2015 19:01:58 -0700

Hi,

I am running a crawl on a website that serves pages and images via php. Nutch 
doesn’t seem to crawl these pages.


I see the below in the hadoop.log
015-10-03 12:48:31,091 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type text/x-php, but they 
are not mapped to it  in the parse-plugins.xml file
2015-10-03 12:48:31,712 ERROR tika.TikaParser - Can't retrieve Tika parser for 
mime-type text/x-php
2015-10-03 12:48:31,713 WARN  parse.ParseSegment - Error parsing: 
http://www.arguntrader.com/ucp.php?mode=login: failed(2,0): Can't retrieve Tika 
parser for mime-type text/x-php

Can anyone help with identifying what is to be done to crawl a site which 
serves pages via php?

Regards
Girish

Nutch not recognizing html pages/images retrieved via php

Reply via email to