On Thu, Dec 01, 2011 at 11:10:44AM -0500, Whalen, Liam wrote: > I thought robots.txt was meant to be checked by web crawlers?
D'oh! Yes, of course, robots.txt is irrelevant. Sorry, I don't know what I was thinking. > > and you really should randomize the order in which you check links > > so that you don't end up unknowingly burying a server in a flurry of > > requests. > > This is a good point of which I hadn't thought. Doing this would add > a fair bit of size to the database because the URLs to be checked > would need to be stored in the database. [...] > > (This last point is where I've been burned in the past.) Also, I > > assume (blithely!) that there are already plenty of good link > > checkers out there -- you could even use something as simple as curl > > or wget with the proper options. > > I'm using Perl's HTTP::Request package to do the link checking. It > would be fairly straight forward to separate the searching of the URLs > from the checking. I could add a new method to the LinkChecker.pm > that would harvest the URLs, then modify the current checking code to > loop over that data. I'm not sure if having the list of URLs stored > permanently in a separate table is entirely worth while though. The > data already exists in the MARC records. Storing it again is > duplication. Storing it to sort it is another matter. Perhaps the > best option would be to store it, sort it, check the URLs, then delete > the data. Could you ORDER BY RAND() or some such? That way you wouldn't need to store the URLs at all (unless I'm missing something). > That wasn't harsh at all. Thanks for your input! Whew! You're very welcome. Paul. -- Paul Hoffman <[email protected]> Systems Librarian Fenway Libraries Online c/o Wentworth Institute of Technology 550 Huntington Ave. Boston, MA 02115 (617) 445-2914 (617) 442-2384 (FLO main number)
