-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> excellent! since it's pure java, it opens the door to cookies too, yes?
> how about multithreading -- essentially speeding up the process..?
There's an interesting problem with "speeding up" Plucker.. well,
several problems, but I'll stay at the 40,000-foot level for now.
As you add multiple "threads" to fetch content, you run into the
risk of fetching duplicate instances of content. An example is Slashdot's
main page. It links to Freshmeat.net. If you also try to gather Freshmeat's
webpage, it links to Slashdot. If you had two threads going, each fetching
each of those webpages, you now run the risk of pulling duplicate content,
which wastes both time and bandwidth. To do that properly, you now need to
share the memory between the threaded process, which isn't fun from a
programming perspective, or you write your fetched data to a file, and have
to provide locking mechanisms.
Ideally, you have a thread testing the links specified (using HEAD,
not GET), up to maxdepth, for validity (are they up? down? inaccessible?
blocked?) and returning a list of valid urls to another thread, who can then
sort the dupes of those found, and begin throwing the valid links off to a
third "gather" thread, who can fetch the content itself. There's a slight
problem here though, when you gather, you have to traverse, and now the
first thread that is needed to test validity, becomes part of the spidering
thread. I've got a design that works very well here, but it doesn't lend
itself to threading, though it does the same/similar things in a much more
rapid way.
The other, more hurtful thing to Plucker as a successful project, is
that users always want it to act faster, and Plucker may not always be the
limiting case. There are websites who may shut us off for being "fast" at
fetching their content. Also, we may run into robots.txt that prohibits
fetching more than one page at a time within a certain period of time. Right
now, we ignore robots.txt, but it's my belief that we shouldn't. It's
abusive not to, and we're only going to piss off content owners by not
adhering to the standards set forth which recommend using robots.txt.
So therein lies a conflict, users want the content as fast as
possible, content providers are prohibiting users from slamming their
servers to get the content that fast, so what do you do?
d.
perldoc -qa.j | perl -lpe '($_)=m("(.*)")'
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.1.92 (GNU/Linux)
iD8DBQE9jvBKkRQERnB1rkoRAhopAKCjnflMvWCMloIAVd5SkNfLQm4XSACg31GM
liW/J2UV585gXw1Qr/zCnu0=
=jTsX
-----END PGP SIGNATURE-----
_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev