There seem to be two slightly different approaches to eliminating duplicates downloads at present.
As far as I can see, the Regex parser ignores duplicate URLs before applying the base Url, whereas JTidy and HtmlParser code ignore duplicates after applying the base url. [The regex parser also takes note of BASE tags; neither of the others do at present - this should probably be fixed.] I don't know which approach to duplicates is more appropriate, but it seems to me that the same algorithm should be used for both? Now, presumably the purpose of ignoring the duplicate URLs is to emulate browser caching? In which case, should we take note of Cache and Expires headers? If so, we would need to use something other than a Set to hold the URLs - but an Iterator would still be valid. Perhaps the Parsers should return everything they find, and then there could be a separate stage to eliminate duplicates? -- Sebastian Bazley --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
