Dear all,

I am new to this list, and even though I tried to go through the archives, I
couldn't find this problem mentioned before.

Summary:
-------

Plucker Desktop has a link-traversal/MAXDEPTH problem that 1.1.13 does not
exhibit.

Possible cause/Suggested solution: [for developers, really... :-]
---------------------------------

Seems to be caused by doing depth first traversal instead of breadth first.
Using a FIFO queue instead of a LIFO stack for parsed URLs could solve the
problem.  But I haven't looked at the source - I'm not really comfortable with
Python ;-)

Detail:
------

Say MAXDEPTH=3.  A link appears on one page at level 2, but *also* (through a
different path) on another page on level 3.  If Plucker arrives at the "level
2" page before the "level 3" page, it will fetch that link.  If, however, it
arrives at the "level 3" page first, it does not fetch it.

In other words, order matters in determining what gets fetched, not just
level.

Simple example:
--------------

http://www.wired.com/news_drop/palmpilot/index.html is a low bandwidth site.
The main (level 1) page has 5 links (Top Stories, Business, Culture,
Technology, and Politics - I will abbreviate them as TS, B, C, T, and P in my
description below).

Each "level 2" page has 1 or more article summaries, each of which links to
the corresponding full article.  The problem is that Plucker will fetch full
articles *only* for the Politics page, no others!

Take *special note* of the bottom "nav bar" on each of the 5 level 2 pages -
the same links (TS, B, C, T, and P) appear there.  They are part of the
problem :-)

Here's what Plucker does (remember MAXDEPTH = 3):

--> depth 1 fetch MAIN page
    save 5 links TS, B, C, T, P
--> depth 2 fetch P (Politics)
    [Plucker appears to put them in LIFO order, so P gets picked up first]
    push URLs for article headers to stack
    push URLs for bottom "nav bar" (same TS, B, C, T, and P) to stack
--> depth 3 fetch pages (TS, B, C, T) from nav bar links stored just now.
    DO NOT recurse, because you are already at level 3
    Actually, they are fetched in the order T, C, B, and TS.
--> depth 3 fetch articles for article headers parsed from the "P" page
    (now done with P page)
--> depth 2 look at stack and see "T" as the next entry
    DISCARD it because it's already been parsed
    PROBLEM: article bodies for article headers in T never get fetched.

I hope that makes sense.  The end result is that I can see detailed articles
ONLY for the Politics page.  None of the others.

If my description of the problem is pathetic I ask that you go to
http://www.wired.com/news_drop/palmpilot/index.html and browse around a bit,
then try to download this via the Plucker Desktop, at MAXDEPTH=3.

Thanks for a great product!

Sitaram
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to