Once again, the Slashdot litmus test

David A. Desrosiers Mon, 11 Mar 2002 07:26:11 -0800


        I'm whacking away at the new my.plkr.org site and noticed yet
another gotcha with my Slashdot litmus test. I'm using the latest parser
from HEAD, which is nice, but still leaves something to be desired with
respet to the depth of crawling *ONLY ONE DOMAIN*. Here's an example:


plucker-build -V 1 -H "http://slashdot.org/palm/"; -f         \
        /tmp/Slashdot --staybelow="http://slashdot.org/palm"; \
        --zlib-compression --maxdepth=3 --bpp=4              \
        --url-pattern='http://.*slashdot\.org/.*'

        This gives "most" of the desired results, except once again we have
conflicting options. I would like to get slashdot.org/palm and anything
below it to a depth of say... 100. When I use the above syntax, we have
conflicting options. I can't use --staybelow and also use the current url
pattern there (yes, I didn't tweak it out yet, or play with
exclusionlist.txt at this point).

        Using the above syntax, or removing the staybelow either allows me
to get the articles linked at the bottom of each story ("Top 10 replies")
and includes the 600 offsite links to osdn.com, newsforge, etc. or, using
staybelow will prohibit gathering of images, even if url-pattern is
specified (i.e. staybelow overrides url-pattern).

        So we're back to a few ideas/solutions. I'm still clamoring for a
--stayondomain argument, which will go from the "dot" domain back up (i.e.
ARPA specification, like .org.slashdot) and stop there, so *ANYTHING* that
is on that domain itself (not the hostname, the _DOMAIN_) will be included,
this means images.slashdot.org, articles.slashdot.org, etc.

        The other idea is coupled to that, and allows a maximum of links to
be gathered before it stops. --maxlinks=200 for example, would stop the
parse, roll up the existing data at that point, and pack it into the pdb.
This could be coupled with a --breadth-first --depth-first option pair,
depending on your needs (I brought this exact pair of options up about 2
years ago as well).

        So right now, I'm stuck with either including the linked articles in
the Slashdot plucked database (--maxdepth=3, and also including 600 offsite
links, a perfect case for the --stayondomain argument), or I'm using
--maxdepth=2, missing the linked articles, and forced to use --staybelow so
that I can not stray from the originating domain.

        I'll put some more neurons on this later and see if I can come up
with a more workable collection of url-pattern and exclusionlist.txt to get
the results I want.



[dd]

Once again, the Slashdot litmus test

Reply via email to