Re: Question on 2.x sitemap functionality

Michael Chen Tue, 01 Aug 2017 17:45:36 -0700

Hi Kenneth,

Thanks for following up! Besides the fact that there is almost nojavadoc available for the sitemap classes and a lot of the main jobclasses... I was mainly using the GSOC project page and lifecycle pdf asreference. The nutch 2 lifecycle pdf says that sitemap detection is doneduring injection, but I just found it to be within fetching with the-stmDetect flag. Looking at the code also confirms that fetch is theonly process that uses the CommonCrawler sitemap features. In addition,the sitemap feature wiki page contains only a link to the GSOC projectfor Nutch 2.x, which is what I'm using.

In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing to getit working on Windows (hadoop binary file related problems, didextensive troubleshooting). Let me know if there's any additionalinformation I can provide you with.

I completely understand that documentation for a community project canbe difficult, and I'll be more than happy to add/fix some if I can. Butright now I'm still trying to verify/falsify some of the claims in thedocumentation...


Thanks!

Michael


On 08/01/2017 05:30 PM, kenneth mcfarland wrote:

Can you please be more specific about your environment and what youhave found to be out of date please?

On Aug 1, 2017 5:28 PM, "Michael Chen"<[email protected]<mailto:[email protected]>> wrote:


    Problem resolved. The crawl script and web documentation are out
    of date. Nutch script works fine.

    Might be a good idea to update sitemap related documentation at
    some point... takes quite a bit of speculation and experimentation
    right now...

    Thanks!

    Michael


    On 07/31/2017 12:21 PM, Michael Chen wrote:

        Dear fellow Nutch developers,

        I've been trying to use Nutch 2 sitemap function to crawl and
        index all pages on the sitemap indices. It seems that
        integration with CommonCrawler sitemap tools only exist in 2.x
        branch. But after I got it to work with Hbase 1.2.3, it didn't
        fetch, parse and index the sitemap indices and sitemaps at all.

        I also looked into the code a bit and everything seems to make
        sense, except I couldn't further trace the data flow beyond
        Toolrunner.run() in the FetchReducer. I'm testing it on Linux
        with the "crawl" script in /bin, so I'm not sure if how I can
        debug this. Please let me know if there's any further
        information that I can provide you with to help troubleshoot
        this issue. Thanks in advance!

        Best regards,

        Michael

Re: Question on 2.x sitemap functionality

Reply via email to