Frank McCown wrote:
I'm teaching a search engine course for CS undergrads, and we'd like
to make a contribution to Nutch.  It appears that Nutch does not
support the Sitemap Protocol (NUTCH-158).

http://sitemaps.org/

Correct.


So I wanted to check with you all and see if this is something you
think would make a good addition.  Also, do you think this would be a
good project for a team of 3 undergrad students who need to complete
it within 2-3 weeks?  Being only modestly familiar with the codebase
myself, I don't want to assign a project that would be too difficult
or overwhelming for undergraduates who are newbies and have only been
writing Java code for a few semesters.

I think it would be a welcome addition. The question is more about whether the students are prepared to go through a few rounds of review and polishing the code so that it's fit for committing.

Also you may have heard of the new rel="canonical" attribute which is
now being supported by Google, Yahoo, and Live:

http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

I'd like my students to modify Nutch to support this new attribute as well.

This sounds like a useful addition, too.

One important note: we are in the process of re-thinking the Nutch architecture, so it's likely that after 1.0 release is out the door we will concentrate on a heavy redesign.

For this reason it would best if this new functionality could be well separated from existing classes, e.g.in utility classes, or in an extension point that other existing Nutch classes can use.



After I get some feedback, I'll submit a request to JIRA.  I was
wondering though, would it be better to submit it as an issue for 0.9,
1.0, or 1.1?

1.1. We are putting final touches to 1.0, and new development will happen only on the trunk.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to