Please know the inquiry is simply to understand how myself and others can document the code better. Thank you for your response.
Kenneth On Aug 1, 2017 5:45 PM, "Michael Chen" <yiningchen2...@u.northwestern.edu> wrote: > Hi Kenneth, > > Thanks for following up! Besides the fact that there is almost no javadoc > available for the sitemap classes and a lot of the main job classes... I > was mainly using the GSOC project page and lifecycle pdf as reference. The > nutch 2 lifecycle pdf says that sitemap detection is done during injection, > but I just found it to be within fetching with the -stmDetect flag. Looking > at the code also confirms that fetch is the only process that uses the > CommonCrawler sitemap features. In addition, the sitemap feature wiki page > contains only a link to the GSOC project for Nutch 2.x, which is what I'm > using. > > In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing to get it > working on Windows (hadoop binary file related problems, did extensive > troubleshooting). Let me know if there's any additional information I can > provide you with. > > I completely understand that documentation for a community project can be > difficult, and I'll be more than happy to add/fix some if I can. But right > now I'm still trying to verify/falsify some of the claims in the > documentation... > > Thanks! > > Michael > > On 08/01/2017 05:30 PM, kenneth mcfarland wrote: > > Can you please be more specific about your environment and what you have > found to be out of date please? > > On Aug 1, 2017 5:28 PM, "Michael Chen" <yiningchen2...@u.northwestern.edu> > wrote: > >> Problem resolved. The crawl script and web documentation are out of date. >> Nutch script works fine. >> >> Might be a good idea to update sitemap related documentation at some >> point... takes quite a bit of speculation and experimentation right now... >> >> Thanks! >> >> Michael >> >> >> On 07/31/2017 12:21 PM, Michael Chen wrote: >> >>> Dear fellow Nutch developers, >>> >>> I've been trying to use Nutch 2 sitemap function to crawl and index all >>> pages on the sitemap indices. It seems that integration with CommonCrawler >>> sitemap tools only exist in 2.x branch. But after I got it to work with >>> Hbase 1.2.3, it didn't fetch, parse and index the sitemap indices and >>> sitemaps at all. >>> >>> I also looked into the code a bit and everything seems to make sense, >>> except I couldn't further trace the data flow beyond Toolrunner.run() in >>> the FetchReducer. I'm testing it on Linux with the "crawl" script in /bin, >>> so I'm not sure if how I can debug this. Please let me know if there's any >>> further information that I can provide you with to help troubleshoot this >>> issue. Thanks in advance! >>> >>> Best regards, >>> >>> Michael >>> >>> >>> >> >