Problem resolved. The crawl script and web documentation are out of
date. Nutch script works fine.
Might be a good idea to update sitemap related documentation at some
point... takes quite a bit of speculation and experimentation right now...
Thanks!
Michael
On 07/31/2017 12:21 PM, Michael Chen wrote:
Dear fellow Nutch developers,
I've been trying to use Nutch 2 sitemap function to crawl and index
all pages on the sitemap indices. It seems that integration with
CommonCrawler sitemap tools only exist in 2.x branch. But after I got
it to work with Hbase 1.2.3, it didn't fetch, parse and index the
sitemap indices and sitemaps at all.
I also looked into the code a bit and everything seems to make sense,
except I couldn't further trace the data flow beyond Toolrunner.run()
in the FetchReducer. I'm testing it on Linux with the "crawl" script
in /bin, so I'm not sure if how I can debug this. Please let me know
if there's any further information that I can provide you with to help
troubleshoot this issue. Thanks in advance!
Best regards,
Michael