At 05:04 PM 3/13/2013, Dan McCullough wrote
>Web bots can ignore the robots.txt file, most scrapers would.
and at 05:06 PM 3/13/2013, Marc Guay wrote:
>These don't sound like robots that would respect a txt file to me.
Dan and Marc are correct. Although I used the terms "spiders" and "pirates" I
believe that the correct term, as employed by Dan, is "scrapers," and that
twerm might be applied to either the robot or the site which displays its
results. One blogger has called scrapers "the arterial plaque of the Internet."
I need to implement a solution that allows humans to access my files but
prevents scrapers from accessing them. I will undoubtedly have to implement
some type of challenge-and-response in the system (such as a captcha), but as
long as those files are stored below the web root a scraper that has a valid
URL can probably grab them. That is part of what the "public" in public_html
One of the reasons why this irks me is that the scrapers are all commercial
sites, but they haven't offered me a piece of the action for the use of my
files. My domain is an entirely non-commercial domain, and I provide free
hosting for other non-commercial genealogical works, primarily pages that are
part of the USGenWeb Project, which is perhaps the largest of all
non-commercial genealogical projects.
Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project
Administrator of http://plymouthcolony.net
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php