On Mar 13, 2013 7:06 PM, "David Robley" <robl...@aapt.net.au> wrote: > > "Dale H. Cook" wrote: > > > At 05:04 PM 3/13/2013, Dan McCullough wrote > > : > >>Web bots can ignore the robots.txt file, most scrapers would. > > > > and at 05:06 PM 3/13/2013, Marc Guay wrote: > > > >>These don't sound like robots that would respect a txt file to me. > > > > Dan and Marc are correct. Although I used the terms "spiders" and > > "pirates" I believe that the correct term, as employed by Dan, is > > "scrapers," and that twerm might be applied to either the robot or the > > site which displays its results. One blogger has called scrapers "the > > arterial plaque of the Internet." I need to implement a solution that > > allows humans to access my files but prevents scrapers from accessing > > them. I will undoubtedly have to implement some type of > > challenge-and-response in the system (such as a captcha), but as long as > > those files are stored below the web root a scraper that has a valid URL > > can probably grab them. That is part of what the "public" in public_html > > implies. > > > > One of the reasons why this irks me is that the scrapers are all > > commercial sites, but they haven't offered me a piece of the action for > > the use of my files. My domain is an entirely non-commercial domain, and I > > provide free hosting for other non-commercial genealogical works, > > primarily pages that are part of the USGenWeb Project, which is perhaps > > the largest of all non-commercial genealogical projects. > > > > readfile() is probably where you want to start, in conjunction with a > captcha or similar > > -- > Cheers > David Robley > > Catholic (n.) A cat with a drinking problem. > > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php >
If the files are delivered via the web, by php or some other means, even if located outside webroot, they'd still be scrapeable.