At 05:04 PM 3/13/2013, Dan McCullough wrote : >Web bots can ignore the robots.txt file, most scrapers would.
and at 05:06 PM 3/13/2013, Marc Guay wrote: >These don't sound like robots that would respect a txt file to me. Dan and Marc are correct. Although I used the terms "spiders" and "pirates" I believe that the correct term, as employed by Dan, is "scrapers," and that twerm might be applied to either the robot or the site which displays its results. One blogger has called scrapers "the arterial plaque of the Internet." I need to implement a solution that allows humans to access my files but prevents scrapers from accessing them. I will undoubtedly have to implement some type of challenge-and-response in the system (such as a captcha), but as long as those files are stored below the web root a scraper that has a valid URL can probably grab them. That is part of what the "public" in public_html implies. One of the reasons why this irks me is that the scrapers are all commercial sites, but they haven't offered me a piece of the action for the use of my files. My domain is an entirely non-commercial domain, and I provide free hosting for other non-commercial genealogical works, primarily pages that are part of the USGenWeb Project, which is perhaps the largest of all non-commercial genealogical projects. Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants; Plymouth Co. MA Coordinator for the USGenWeb Project Administrator of http://plymouthcolony.net -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php