GSOC- Sitemap support - final evolation

Cihad Guzel Thu, 13 Aug 2015 13:04:47 -0700

Hi all.

You know I am working for NUTCH-1741 for GSOC 2015. I have very little time
for the completion of final evolation for GSOC program. I want to talk
briefly about the process.


My goal is to give support sitemap project. I have almost completed my
work. I commit my code to my github account[1]. I attached the patch file
to the issue[2]. Features developed at this stage are as follows:

+ sitemap files are crawled (inject, generate,fetch and parse)
+ if a host have any sitemap files, they are detected.
+ If desired, only sitemap can be crawled or only other (non sitemap urls)
can be crawled.
+ It is activated with just one parameter (-sitemap).

Please follow the wiki[3] and issue[2] for more information.

Thanks for my mentors ( Lewis & Talat ) and for nutch community.

[1] - https://github.com/cguzel/nutch-sitemapCrawler
[2] - https://issues.apache.org/jira/browse/NUTCH-1741
[3] - https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler

--
Kind regards
Cihad Guzel

GSOC- Sitemap support - final evolation

Reply via email to