hmmm.  good point.  reminds me of college days and "the most effective
optimisation is not to waste time optimising".

the problems are classic ones.  for a given site, it would be a
difficult task without reworking the entire "plucking" approach to use
multi-threaded spidering.

what we could do is, for a given job - we'll use plucker desktop as an
example - start independent channels in their own thread.

another example is the daily news variety of job - we just set a thread
going on each "channel".

i'd reckon these two above are reasonably easy to detect.

when there's only one thing in progress, like generating an ebook, the
delay isn't such a big deal.  when you have 10-20 daily news channels,
amounting to some 500 pages, it's likely to take a lot less time if not
run serially.  

using this approach, we hammer the servers less, which will lead to less
errors.  plucking a couple of different content sources at the same time
should take less time overall.


-kev

-----Message d'origine-----
De : [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]De la part de David A.
Desrosiers
Envoye : 23 September 2002 12:43
A : Plucker Development List
Objet : RE: java flavoured cookies


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> excellent!  since it's pure java, it opens the door to cookies too,
yes?
> how about multithreading -- essentially speeding up the process..?

        There's an interesting problem with "speeding up" Plucker..
well,
several problems, but I'll stay at the 40,000-foot level for now.

        As you add multiple "threads" to fetch content, you run into the
risk of fetching duplicate instances of content. An example is
Slashdot's
main page. It links to Freshmeat.net. If you also try to gather
Freshmeat's
webpage, it links to Slashdot. If you had two threads going, each
fetching
each of those webpages, you now run the risk of pulling duplicate
content,
which wastes both time and bandwidth. To do that properly, you now need
to
share the memory between the threaded process, which isn't fun from a
programming perspective, or you write your fetched data to a file, and
have
to provide locking mechanisms.

        Ideally, you have a thread testing the links specified (using
HEAD,
not GET), up to maxdepth, for validity (are they up? down? inaccessible?
blocked?) and returning a list of valid urls to another thread, who can
then
sort the dupes of those found, and begin throwing the valid links off to
a
third "gather" thread, who can fetch the content itself. There's a
slight
problem here though, when you gather, you have to traverse, and now the
first thread that is needed to test validity, becomes part of the
spidering
thread. I've got a design that works very well here, but it doesn't lend
itself to threading, though it does the same/similar things in a much
more
rapid way.

        The other, more hurtful thing to Plucker as a successful
project, is
that users always want it to act faster, and Plucker may not always be
the
limiting case. There are websites who may shut us off for being "fast"
at
fetching their content. Also, we may run into robots.txt that prohibits
fetching more than one page at a time within a certain period of time.
Right
now, we ignore robots.txt, but it's my belief that we shouldn't. It's
abusive not to, and we're only going to piss off content owners by not
adhering to the standards set forth which recommend using robots.txt.

        So therein lies a conflict, users want the content as fast as
possible, content providers are prohibiting users from slamming their
servers to get the content that fast, so what do you do?



d.

perldoc -qa.j | perl -lpe '($_)=m("(.*)")'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.1.92 (GNU/Linux)

iD8DBQE9jvBKkRQERnB1rkoRAhopAKCjnflMvWCMloIAVd5SkNfLQm4XSACg31GM
liW/J2UV585gXw1Qr/zCnu0=
=jTsX
-----END PGP SIGNATURE-----

_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev
_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev

Reply via email to