Hi Kenneth,
Thanks for following up! Besides the fact that there is almost no
javadoc available for the sitemap classes and a lot of the main job
classes... I was mainly using the GSOC project page and lifecycle pdf as
reference. The nutch 2 lifecycle pdf says that sitemap detection is done
during injection, but I just found it to be within fetching with the
-stmDetect flag. Looking at the code also confirms that fetch is the
only process that uses the CommonCrawler sitemap features. In addition,
the sitemap feature wiki page contains only a link to the GSOC project
for Nutch 2.x, which is what I'm using.
In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing to get
it working on Windows (hadoop binary file related problems, did
extensive troubleshooting). Let me know if there's any additional
information I can provide you with.
I completely understand that documentation for a community project can
be difficult, and I'll be more than happy to add/fix some if I can. But
right now I'm still trying to verify/falsify some of the claims in the
documentation...
Thanks!
Michael
On 08/01/2017 05:30 PM, kenneth mcfarland wrote:
Can you please be more specific about your environment and what you
have found to be out of date please?
On Aug 1, 2017 5:28 PM, "Michael Chen"
<[email protected]
<mailto:[email protected]>> wrote:
Problem resolved. The crawl script and web documentation are out
of date. Nutch script works fine.
Might be a good idea to update sitemap related documentation at
some point... takes quite a bit of speculation and experimentation
right now...
Thanks!
Michael
On 07/31/2017 12:21 PM, Michael Chen wrote:
Dear fellow Nutch developers,
I've been trying to use Nutch 2 sitemap function to crawl and
index all pages on the sitemap indices. It seems that
integration with CommonCrawler sitemap tools only exist in 2.x
branch. But after I got it to work with Hbase 1.2.3, it didn't
fetch, parse and index the sitemap indices and sitemaps at all.
I also looked into the code a bit and everything seems to make
sense, except I couldn't further trace the data flow beyond
Toolrunner.run() in the FetchReducer. I'm testing it on Linux
with the "crawl" script in /bin, so I'm not sure if how I can
debug this. Please let me know if there's any further
information that I can provide you with to help troubleshoot
this issue. Thanks in advance!
Best regards,
Michael