Re: Question on 2.x sitemap functionality

kenneth mcfarland Tue, 01 Aug 2017 21:29:52 -0700

Please know the inquiry is simply to understand how myself and others can
document the code better. Thank you for your response.


Kenneth

On Aug 1, 2017 5:45 PM, "Michael Chen" <yiningchen2...@u.northwestern.edu>
wrote:

> Hi Kenneth,
>
> Thanks for following up! Besides the fact that there is almost no javadoc
> available for the sitemap classes and a lot of the main job classes... I
> was mainly using the GSOC project page and lifecycle pdf as reference. The
> nutch 2 lifecycle pdf says that sitemap detection is done during injection,
> but I just found it to be within fetching with the -stmDetect flag. Looking
> at the code also confirms that fetch is the only process that uses the
> CommonCrawler sitemap features. In addition, the sitemap feature wiki page
> contains only a link to the GSOC project for Nutch 2.x, which is what I'm
> using.
>
> In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing to get it
> working on Windows (hadoop binary file related problems, did extensive
> troubleshooting). Let me know if there's any additional information I can
> provide you with.
>
> I completely understand that documentation for a community project can be
> difficult, and I'll be more than happy to add/fix some if I can. But right
> now I'm still trying to verify/falsify some of the claims in the
> documentation...
>
> Thanks!
>
> Michael
>
> On 08/01/2017 05:30 PM, kenneth mcfarland wrote:
>
> Can you please be more specific about your environment and what you have
> found to be out of date please?
>
> On Aug 1, 2017 5:28 PM, "Michael Chen" <yiningchen2...@u.northwestern.edu>
> wrote:
>
>> Problem resolved. The crawl script and web documentation are out of date.
>> Nutch script works fine.
>>
>> Might be a good idea to update sitemap related documentation at some
>> point... takes quite a bit of speculation and experimentation right now...
>>
>> Thanks!
>>
>> Michael
>>
>>
>> On 07/31/2017 12:21 PM, Michael Chen wrote:
>>
>>> Dear fellow Nutch developers,
>>>
>>> I've been trying to use Nutch 2 sitemap function to crawl and index all
>>> pages on the sitemap indices. It seems that integration with CommonCrawler
>>> sitemap tools only exist in 2.x branch. But after I got it to work with
>>> Hbase 1.2.3, it didn't fetch, parse and index the sitemap indices and
>>> sitemaps at all.
>>>
>>> I also looked into the code a bit and everything seems to make sense,
>>> except I couldn't further trace the data flow beyond Toolrunner.run() in
>>> the FetchReducer. I'm testing it on Linux with the "crawl" script in /bin,
>>> so I'm not sure if how I can debug this. Please let me know if there's any
>>> further information that I can provide you with to help troubleshoot this
>>> issue. Thanks in advance!
>>>
>>> Best regards,
>>>
>>> Michael
>>>
>>>
>>>
>>
>

Re: Question on 2.x sitemap functionality

Reply via email to