hmmm. good point. reminds me of college days and "the most effective optimisation is not to waste time optimising".
the problems are classic ones. for a given site, it would be a difficult task without reworking the entire "plucking" approach to use multi-threaded spidering. what we could do is, for a given job - we'll use plucker desktop as an example - start independent channels in their own thread. another example is the daily news variety of job - we just set a thread going on each "channel". i'd reckon these two above are reasonably easy to detect. when there's only one thing in progress, like generating an ebook, the delay isn't such a big deal. when you have 10-20 daily news channels, amounting to some 500 pages, it's likely to take a lot less time if not run serially. using this approach, we hammer the servers less, which will lead to less errors. plucking a couple of different content sources at the same time should take less time overall. -kev -----Message d'origine----- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]De la part de David A. Desrosiers Envoye : 23 September 2002 12:43 A : Plucker Development List Objet : RE: java flavoured cookies -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > excellent! since it's pure java, it opens the door to cookies too, yes? > how about multithreading -- essentially speeding up the process..? There's an interesting problem with "speeding up" Plucker.. well, several problems, but I'll stay at the 40,000-foot level for now. As you add multiple "threads" to fetch content, you run into the risk of fetching duplicate instances of content. An example is Slashdot's main page. It links to Freshmeat.net. If you also try to gather Freshmeat's webpage, it links to Slashdot. If you had two threads going, each fetching each of those webpages, you now run the risk of pulling duplicate content, which wastes both time and bandwidth. To do that properly, you now need to share the memory between the threaded process, which isn't fun from a programming perspective, or you write your fetched data to a file, and have to provide locking mechanisms. Ideally, you have a thread testing the links specified (using HEAD, not GET), up to maxdepth, for validity (are they up? down? inaccessible? blocked?) and returning a list of valid urls to another thread, who can then sort the dupes of those found, and begin throwing the valid links off to a third "gather" thread, who can fetch the content itself. There's a slight problem here though, when you gather, you have to traverse, and now the first thread that is needed to test validity, becomes part of the spidering thread. I've got a design that works very well here, but it doesn't lend itself to threading, though it does the same/similar things in a much more rapid way. The other, more hurtful thing to Plucker as a successful project, is that users always want it to act faster, and Plucker may not always be the limiting case. There are websites who may shut us off for being "fast" at fetching their content. Also, we may run into robots.txt that prohibits fetching more than one page at a time within a certain period of time. Right now, we ignore robots.txt, but it's my belief that we shouldn't. It's abusive not to, and we're only going to piss off content owners by not adhering to the standards set forth which recommend using robots.txt. So therein lies a conflict, users want the content as fast as possible, content providers are prohibiting users from slamming their servers to get the content that fast, so what do you do? d. perldoc -qa.j | perl -lpe '($_)=m("(.*)")' -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.1.92 (GNU/Linux) iD8DBQE9jvBKkRQERnB1rkoRAhopAKCjnflMvWCMloIAVd5SkNfLQm4XSACg31GM liW/J2UV585gXw1Qr/zCnu0= =jTsX -----END PGP SIGNATURE----- _______________________________________________ plucker-dev mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-dev _______________________________________________ plucker-dev mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-dev
