Re: Plucker for Linux

David A. Desrosiers Mon, 14 Feb 2005 14:39:39 -0800

Agreed, and yet I barely have time (actually, I -don't- have time) to pay adequate attention to the FLOSS projects I've already more or less committed my time to.

Unfortunately, many of us also have little or no time to spend on these kinds of issues as well, so relying on the users to do as much groundwork as they can to help us to fix it as fast as possible.

So let's try to do that... what exactly is the problem you've seen, and what exactly is the kind of result you'd expect? From what I read, you basically want to exclude the building of the viewer components, if not explicitly specified, yes?

That's about what I expected then, except I was expecting to only do an RSS feed...

Why not just use the links that provide those RSS feeds instead?

        http://plkr.org/rss.pl
        http://plkr.org/rdf.pl

Should adding -c to my plucker-build help?


    I'm not even sure that works anymore.. lemme give it a test:

        [time passes]

...well, it writes the cache, but doesn't appear to check the upstream site's Last-Modified header (if present), so it just refetches the content over and over.

(How do you check if something has changed, without first downloading it? Is there some sort of timestamping and/or message digesting going on that I'm not familiar with?)

Generally, you issue a HEAD request to the server's resource, and check the Last-Modified date (if present) and fetch it if it is more-recent than the local copy of that stored resource. Most current webservers support HEAD, but not all of them support Last-Modified header, and many types of dynamic pages (even when the content they serve up doesn't change) will present a new Last-Modified date, which would force a re-fetch. There's a trick to checking that as well, by checking Content-Length of the resource, but this also requires that you keep (and track) these items locally, in some sort of local dbm, cache, or whatever... at fetch time.

OK, no problem. I've reduced the frequency to once every two weeks, instead of once per day. The site doesn't seem to change much anyway.

Well, what is the purpose? To find new News articles? Or to find new pages on the site? Why turn the whole site into a Plucker document with your Plucker spider a few times a day? I can easily just create a new .pdb of the site when it changes, and you can just fetch that daily, hourly, or whatever... or just use the RSS/RDF feeds.

Thanks. BTW, does plucker-build respect robots.txt?

The Python, Java, and C++ versions of the Plucker distillers presently, do not.


David A. Desrosiers
[EMAIL PROTECTED]
http://gnu-designs.com
_______________________________________________
plucker-list mailing list
[email protected]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Re: Plucker for Linux

Reply via email to