Are Etags worth looking at?
Sean // [EMAIL PROTECTED]
Bob Worthy wrote:
>
> Weiguo, yes i've had the same problem with my robot. Randal Schwartz
> speaks the truth about missing data i'm sure. It sounds like you and i
> are doing much the same thing.
>
> The problem is actually worse than simply not having the last modified
> date available. Many sites lie about the last modified date in the sense
> that the last modified date is always the date-time the page was requested
> from the site. I went to a scheme of trying to determine page changes by
> looking at the size of the page compared to when i last requested it.
> After all, when you fetch a page you can count the bytes yourself. But
> this often fails too, because many many pages have useless operators that,
> for instance, display the current date on the page. Since this changes the
> length of the page virtually every day the page length is also a useless
> determinator of whether a page has changed.
>
> On the plus side, it is usually easy to determine when a site is playing
> these (to me anyway) stupid games. Those pages i simply treat as
> UN-changed for some default period. It isn't a good solution, but it keeps
> the cheating sites from pushing their pages to the front every time.
>
> If anyone has better ideas i'd LOVE to hear them.
>
> I guess this is a bit off-topic, and apologize for that.
>
> On 29 Feb 2000, Randal L. Schwartz wrote:
>
> > >>>>> "Weiguo" == Weiguo Fan <[EMAIL PROTECTED]> writes:
> >
> > Weiguo> I am trying to get the last_modified_date for all the
> > Weiguo> urls. But I found not every web server support this header
> > Weiguo> information. Is there anyway that I can calculatet or get the
> > Weiguo> modification date?
> >
> > If the information is not in the response, it doesn't have a "last
> > modification date". Period. You can't compute something that doesn't
> > exist. Or rather, any such computation would be a lie and misleading.
> >
> > You're probably seeing a bunch of dynamic pages, since dynamic pages
> > with SSI tend not to have a last-modified, since the last-modified is
> > always "whenever you just asked for it". Most of the pages on
> > www.stonehenge.com are like that, for example. "last-modified" makes
> > sense only for static data.
> >
> > If you intend to call "if-modified-since" on refetching those URLs,
> > you're likely never to get a "304" response, so not knowing the
> > last-modified really doesn't make any difference.