I'm teaching a search engine course for CS undergrads, and we'd like to make a contribution to Nutch. It appears that Nutch does not support the Sitemap Protocol (NUTCH-158).
http://sitemaps.org/ So I wanted to check with you all and see if this is something you think would make a good addition. Also, do you think this would be a good project for a team of 3 undergrad students who need to complete it within 2-3 weeks? Being only modestly familiar with the codebase myself, I don't want to assign a project that would be too difficult or overwhelming for undergraduates who are newbies and have only been writing Java code for a few semesters. Also you may have heard of the new rel="canonical" attribute which is now being supported by Google, Yahoo, and Live: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html I'd like my students to modify Nutch to support this new attribute as well. After I get some feedback, I'll submit a request to JIRA. I was wondering though, would it be better to submit it as an issue for 0.9, 1.0, or 1.1? Thanks, Frank -- Frank McCown, Ph.D. Assistant Professor of Computer Science Harding University http://www.harding.edu/fmccown/