Eliminating duplicate downloads during/after parsing

BAZLEY, Sebastian Tue, 25 Nov 2003 07:04:59 -0800

There seem to be two slightly different approaches to eliminating duplicates
downloads at present.


As far as I can see, the Regex parser ignores duplicate URLs before applying
the base Url, whereas JTidy and HtmlParser code ignore duplicates after
applying the base url. [The regex parser also takes note of BASE tags;
neither of the others do at present - this should probably be fixed.]

I don't know which approach to duplicates is more appropriate, but it seems
to me that the same algorithm should be used for both?

Now, presumably the purpose of ignoring the duplicate URLs is to emulate
browser caching?
In which case, should we take note of Cache and Expires headers?
If so, we would need to use something other than a Set to hold the URLs -
but an Iterator would still be valid.
Perhaps the Parsers should return everything they find, and then there could
be a separate stage to eliminate duplicates?

-- 
Sebastian Bazley 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Eliminating duplicate downloads during/after parsing

Reply via email to