At 12:46 PM -0400 9/18/09, Paul Tomblin wrote:
Nutch is, I think, doing the right thing by not
crawling it, but I can't convince her of this because she's convinced that
DP is commercial and Nutch is "only" Open Source, so obviously DP is right.
Just the opposite ... the commercial product is doing it *wrong* (not
respecting robots.txt) while the open source product is doing it
*right* (respecting the file).
The client is ornery and is doing something patently against the
wishes (expressed in the robots.txt file) of the owner(s) of the
content (unless she has permission, in which case get the owner[s] of
the content to include your Nutch agent name in their robots.txt
file[s]).
I know how far and few between paying clients are these days, but
personally -- under the circumstances you've described -- I think
I'd walk away from this project.
\dmc
PS: The robots.txt file shouldn't have any mention of a sitemap,
except possibly to include the URL.
--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
David M. Cole d...@colegroup.com
Editor & Publisher, NewsInc. <http://newsinc.net> V: (650) 557-2993
Consultant: The Cole Group <http://colegroup.com/> F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+