David A. Desrosiers wrote: >> I was thinking of maintaining a simple "registry" (just an XML file >> on a web server) that specifies sites/servers where you have to be >> extra careful. > > Are you suggesting that the parser/distiller fetch this file at > every Pluck? I would consider that a bit of an issue, and I know I > personally wouldn't use that feature, as would a lot of other people. > It's akin to launching an app and having it send out a little UDP > packet to let the server or author know you're using his software. No > thanks.
No, the registry is just a local copy. The user has the option to check for a refresh either automatically (say once a week) or manually. Much like a "check for software update" feature works. >> This way JPluck (or any other parser) can automatically lower the >> connection count if it is set too high by the user, or introduce a >> delay. > > Ick. If I set or override a parameter, I EXPECT it to stay > overridden, and not reverted by some tool I can't control upstream. Use of the registry is of course optional. It is recommended that you use it to avoid problems later, like being banned by Slashdot. This registry could also specify a user agent. So if a particular user agent is required for a site it will send that. This way you don't have to find out yourself, through trial-and-error. >> If the server sends a 90k 404 not found page that is their problem. > > ..and a problem for the client. If you hit a page with 100 links, > and 90 of them are 404's, and you spider into them, that is > potentially 90k times 90 links. That's 8.2 megs of _useless_ data > fetched from the server to the client. A huge waste. If a site is this badly configured it will probably not be worth downloading. The client can kill the connection at any time. Also, images from 404 pages are not retrieved by the JPluck spider since the entire page is discarded in the first place(and no image links are extracted from it). Potentially, only the 404 HTML is downloaded. I don't think that 90k 404 pages are that common. Your example is a bit far-fetched. >> Anyway, browsers perform requests this way (with an >> if-modified-since) as well, so I don't see an issue with this. > > Completely useless on dynamic sites, as I've mentioned before. If > the server is serving up dynamic pages, it has no way of checking > what YOUR content-modification date is. This also increases bandwidth > as well, though to a lesser degree. Potentially, if-modified-since is not necessarily useless on dynamic sites. With if-modified-since the client specifies the date when it last fetched this particular URL, and that is the client modification date. Most web application don't repond to it because it is difficult to track when the back-end has changed. They just do SELECT * FROM table and dump the contents. Web application literature rarely addresses this issue. > No, do _NOT_ default it to AvantGo, because as I said, Plucker is > not AvantGo. Besides, we might be getting ourselves into trouble if > we say we're them in the UserAgent string by default, or ship a tool > that does so. Make it an option to enable "AvantGo Heuristics" as > I've done in the email parser. OK. The default will stay JPluck. Regards -Laurens _______________________________________________ plucker-list mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-list

