According to William R. Knox: > In order to work around issues where htdig will drop entries for files > that have a noindex,follow robot exclusion, I create a file that I feed > into the start_url parameter in the htdig.conf file. You might try this > instead. > > ls -d /path/to/files/* | sed -e \ > 's/\/path\/to\/web\/files\//http:\/\/my.address.mitre.org\//' >> \ > /path/to/file_list > > and, in htdig.conf > > start_url: `/path/to/file_list` > > Does this, by the way, have a limit to how long the list can be, i.e. can > I have thousands of entries in the file that is fed into start_url?
My understanding is that this is only limited by the amount of virtual memory available to htdig. Preloading all the URLs into the queue at the start like this will consume more VM than allowing htdig to spider down a tree of links, but when you run htdig on an existing database to update it, it preloads the list from the db into the queue anyway, so this is no worse. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

