> > <img src="//images.slashdot.org/palm/title_palm.gif">
>
> What's "bad" with that relative url? It will be resolved as
> "http://images.slashdot.org/palm/title_palm.gif";.

        Browsers will properly redirect, because of the regex used to strip
the slash/protocols in relative vs. base urls. Look again at the relative
url there. There are two leading slashes, and this seems to confuse the 5
spiders I threw at it. I was able to get around it with HTML::LinkExtor in
perl (of course), but it seems that other spiders do not have that type of
"business logic" built into them, including Plucker. I have been plucking
that same site hourly for about a week, and depositing the resulting PDB on
the server and it always comes back in my email with a 404 or 500 error,
which I simply ignore.

> The problem Chris is seeing must be some local problem, because I could
> run "plucker-build -v -H http://slashdot.org/palm/ -f Slashdot" just
> fine.

        I just ran it with the base arguments, no modifiers (maxdepth) and
it seems to work. Perhaps they fixed it on the server side? In any case,
there's another URL you can use, or you could simply snarf the following:

        http://www.plkr.org/samples/plucks/Slashdot.pdb (updated hourly)

        This one comes from AvantSlash, local code on the server, which does
better penetration into the comments sections of the site. Personal
preference, I suppose. Wallstreet Journal is there also.

        Also, here's another thing I've been playing with, which is all
driven from external XML and RSS/RDF "backend" content delivery and
syndication outputs:

        http://gnu-designs.com/code/rdf/

        It's all alphabetized. Do not bookmark it, it will certainly be
moved or changed soon. It's simply some new ideas I'm playing with.



/d


Reply via email to