Hi all I have been working for 3 weeks. I prepare a simple weekly report on the nutch wiki[1] in the meantime. You can follow my working progress from the wiki page[2]. You can also follow my code devolopment from my github repo[3].
I want to mention about my weekly works. Sitemaps urls are crawled in two ways. The first one: You can inject the sitmap urls using InjectorJob from seed file. Then the sitemaps fetched are parsed by the sitemap parser plugin. Urls obtained by parsing are written to outlink column in db. These are signed as sitemap. The second one: The sitemap urls are detected from the robot.txt file. The urls are written to sitemap(stm) column in db and these are signed as sitemap. I have attached a graph on the jira issue[4]. You can see the life cycle of sitemap parser from there[5]. I am waiting for your opinion. I shape my works according to your idea. Thanks... [1] https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport [2] https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler [3] https://github.com/cguzel/nutch-sitemapCrawler [4] https://issues.apache.org/jira/browse/NUTCH-1741 [5] https://issues.apache.org/jira/secure/attachment/12707721/SitemapCrawlerLifeCycle.pdf -- Kind Regards Cihad

