On Mon, Mar 11, 2002 at 07:42:22AM -0800, David A. Desrosiers wrote: > So we're back to a few ideas/solutions. I'm still clamoring for a > --stayondomain argument, which will go from the "dot" domain back up (i.e. > ARPA specification, like .org.slashdot) and stop there, so *ANYTHING* that > is on that domain itself (not the hostname, the _DOMAIN_) will be included, > this means images.slashdot.org, articles.slashdot.org, etc.
I very much like that idea,.. What I've had to do to emulate that is create a slashdot.txt file with the contents 0:-:.* 1:+:.+slashdot.org/palm.* 2:-:.*comments.*shtml then call plucker-build with -E slashdot.txt. .. Basically for every site I use as a cronjob pluck, I have a similar setup... its a bit of a pain to go and create a new one when I want to add a new site in. > The other idea is coupled to that, and allows a maximum of links to > be gathered before it stops. --maxlinks=200 for example, would stop the > parse, roll up the existing data at that point, and pack it into the pdb. > This could be coupled with a --breadth-first --depth-first option pair, > depending on your needs (I brought this exact pair of options up about 2 > years ago as well). Recently, PalmInfoCenter had a problem in their /palm/ directory which had a link to their main full site, but within the /palm/ directory.. as a result plucker-build was making me a 1000+ link .pdb file of every single story on their site, full graphics and everything. It would always be nice to have a value to prevent any run-away links. -- Adam McDaniel Array Networks Calgary, AB, Canada
