Re: Eliminating duplicate downloads during/after parsing

Jordi Salvat i Alabart Tue, 25 Nov 2003 08:54:54 -0800

En/na BAZLEY, Sebastian ha escrit:

There seem to be two slightly different approaches to eliminating duplicates
downloads at present.
As far as I can see, the Regex parser ignores duplicate URLs before applying the base Url, whereas JTidy and HtmlParser code ignore duplicates after applying the base url.

Yes, this was probably a bug in the regexp parser. It has been corrected.

[The regex parser also takes note of BASE tags;
neither of the others do at present - this should probably be fixed.]

Indeed.


I don't know which approach to duplicates is more appropriate, but it seems
to me that the same algorithm should be used for both?

With base applied is definitely better.

Now, presumably the purpose of ignoring the duplicate URLs is to emulate
browser caching?

Yes.

In which case, should we take note of Cache and Expires headers?

Of course, but we're not quite there yet...

If so, we would need to use something other than a Set to hold the URLs -
but an Iterator would still be valid.

I'd say we should have a component similar to the Cookie Manager --call it a Browser Cache-- keeping track of which images are in the cache and their status. The job of checking the cache does not belong to the parser.

Perhaps the Parsers should return everything they find, and then there could
be a separate stage to eliminate duplicates?

Could be. Works nicely as it is now, though.

--
Salut,

Jordi.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Eliminating duplicate downloads during/after parsing

Reply via email to