According to William R. Knox:
> In order to work around issues where htdig will drop entries for files
> that have a noindex,follow robot exclusion, I create a file that I feed
> into the start_url parameter in the htdig.conf file. You might try this
> instead.
> 
> ls -d /path/to/files/* | sed -e \
> 's/\/path\/to\/web\/files\//http:\/\/my.address.mitre.org\//' >> \
> /path/to/file_list
> 
> and, in htdig.conf
> 
> start_url:    `/path/to/file_list`
> 
> Does this, by the way, have a limit to how long the list can be, i.e. can
> I have thousands of entries in the file that is fed into start_url?

My understanding is that this is only limited by the amount of virtual
memory available to htdig.  Preloading all the URLs into the queue at
the start like this will consume more VM than allowing htdig to spider
down a tree of links, but when you run htdig on an existing database
to update it, it preloads the list from the db into the queue anyway,
so this is no worse.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to