I'm whacking away at the new my.plkr.org site and noticed yet
another gotcha with my Slashdot litmus test. I'm using the latest parser
from HEAD, which is nice, but still leaves something to be desired with
respet to the depth of crawling *ONLY ONE DOMAIN*. Here's an example:plucker-build -V 1 -H "http://slashdot.org/palm/" -f \ /tmp/Slashdot --staybelow="http://slashdot.org/palm" \ --zlib-compression --maxdepth=3 --bpp=4 \ --url-pattern='http://.*slashdot\.org/.*' This gives "most" of the desired results, except once again we have conflicting options. I would like to get slashdot.org/palm and anything below it to a depth of say... 100. When I use the above syntax, we have conflicting options. I can't use --staybelow and also use the current url pattern there (yes, I didn't tweak it out yet, or play with exclusionlist.txt at this point). Using the above syntax, or removing the staybelow either allows me to get the articles linked at the bottom of each story ("Top 10 replies") and includes the 600 offsite links to osdn.com, newsforge, etc. or, using staybelow will prohibit gathering of images, even if url-pattern is specified (i.e. staybelow overrides url-pattern). So we're back to a few ideas/solutions. I'm still clamoring for a --stayondomain argument, which will go from the "dot" domain back up (i.e. ARPA specification, like .org.slashdot) and stop there, so *ANYTHING* that is on that domain itself (not the hostname, the _DOMAIN_) will be included, this means images.slashdot.org, articles.slashdot.org, etc. The other idea is coupled to that, and allows a maximum of links to be gathered before it stops. --maxlinks=200 for example, would stop the parse, roll up the existing data at that point, and pack it into the pdb. This could be coupled with a --breadth-first --depth-first option pair, depending on your needs (I brought this exact pair of options up about 2 years ago as well). So right now, I'm stuck with either including the linked articles in the Slashdot plucked database (--maxdepth=3, and also including 600 offsite links, a perfect case for the --stayondomain argument), or I'm using --maxdepth=2, missing the linked articles, and forced to use --staybelow so that I can not stray from the originating domain. I'll put some more neurons on this later and see if I can come up with a more workable collection of url-pattern and exclusionlist.txt to get the results I want. [dd]
