Hi Lewis, I and Talat talk about architecture for sitemap supporting . We thought the problem could be solved in nutch life cycle . We don't want to build a different life cycle for sitemap crawling.
So, I have some problems as following: If the sitemap file is too large size, it can not be fetched and parsed. It gets timeout. I solved timeout problem temporarily to parse by raising the value of timeout in nutch-site.xml and to fetch by working small size file. It is not good. Moreover, you know sitemap files have some special tags as "loc", "lastmod", "changefreq" or "priority". It has been parsed using my parse plugin. I want to record to crawldb, but the Parse object doesn't support metadata or same fields. It has only outlink array. It isn't enough for recording metadata. I want to record each url in sitemap file with the metadata seperately. I viewed all patchs and comments from NUTCH-1465 and there are some solution for same problems in it. But, new job for sitemap crawling have been created. Could you show me a way out? Thanks.

