At 12:46 PM -0400 9/18/09, Paul Tomblin wrote:
Nutch is, I think, doing the right thing by not
crawling it, but I can't convince her of this because she's convinced that
DP is commercial and Nutch is "only" Open Source, so obviously DP is right.

Just the opposite ... the commercial product is doing it *wrong* (not respecting robots.txt) while the open source product is doing it *right* (respecting the file).

The client is ornery and is doing something patently against the wishes (expressed in the robots.txt file) of the owner(s) of the content (unless she has permission, in which case get the owner[s] of the content to include your Nutch agent name in their robots.txt file[s]).

I know how far and few between paying clients are these days, but personally -- under the circumstances you've described -- I think I'd walk away from this project.

\dmc

PS: The robots.txt file shouldn't have any mention of a sitemap, except possibly to include the URL.

--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
   David M. Cole                                            d...@colegroup.com
   Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
   Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Reply via email to