Recently in the search app we are working on we've encountered a lot of 
websites that have a wrong and invalid date in the Last Modified HTTP header, 
meaning for instance that an article posted on a news site back in 2010 has a 
Las Modified header of just a few days back, this could be for any number of 
reasons:

- A new comment was added to the site
- Some cache invalidation occurring in the source code of the website that 
affects the article's page
- Perhaps a new ad showing in the sidebar
- Or just plain wrong header handling in the platform code

For what I've seen this is handled by several CMS even allowing to "tweak" the 
published date, My question is basically if any one on the list has a 
suggestion on how to tackle this or has some suggestion on how to address this 
situation. For the particular case that we've been working most of the URLs 
have the published date in the URL in the form of yyyy/mm/dd (or some similar 
fashion), so this could be one way of "guessing" the publication date of the 
article. I realize that this is no silver bullet but I'd love to get some 
feedback on this type of situations. From my experience when people usually 
filter by date in our frontend app, they usually are trying to get 
news/articles by the publication date instead of the Last Modified date and 
they are confused when the returned results have very old publication dates, 
they usually don't check if is a new comment for instance.

I'm living the "how to implement this" a side for now, just interested in 
discussing how to deal with this type of situations, as stated in our 
particular case we can rely on the URL patterns for a very good portion, but 
was hopping to agree on some general approach that could be integrated in Nutch.

Regards,

PS: Should I post this also to the user list? 

Reply via email to