RE: Handling servers with wrong Last Modified HTTP header

Markus Jelsma Wed, 11 Mar 2015 14:14:22 -0700

Hello Jorge,

This is an interesting but very complicated issue. First of all, do not rely on 
HTTP headers, they are incorrect on any scale larger than very small. This is 
true for Last-Modified due to dynamic CMS' but for many other headers. You can 
even expect website descriptions in headers such as Content-Type, madness!


The only reliable source of a document's date and optionally time is within the 
document itself. This introduces two news problems, 1) what format and 
language, and 2) where exactly can you find it. Let's discuss these two issues.

The first is the most straightforward to deal with, it is a two-stage process. 
First you need to extract anything that resembles a date format that is used on 
Earth, this includes non-numeric dates such as month names. Then you have to 
pass all those date candidates through a series of carefully aligned date 
formats (SimpleDateFormat) and set the appropriate Locale. This stage requires 
that you have identified the language of the document, or the part of the 
document you are processing in case of multi-language documents.

Luckily, i have uploaded preliminary work as a Nutch parse-plugin a few years 
ago that does exactly this, check out NUTCH-1414 [1]. You present the extractor 
with a language and a piece of text, in this case the document's extracted 
text. It is very basic and has many flaws but it should work nicely if you 
present it with concise fragments of text.

The second part of the solution is more cumbersome to deal with. NUTCH-1414 
uses the document's extracted text as source for date extraction, and it has 
really no clue as to where the date is located in the document's structure. If 
you use Nutch' basic text extraction (extract all TEXT nodes) you will get bad 
results for most documents. It can be partially solved by relying on 
Boilerpipe's text extraction. But using Boilerpipe may in turn prevent you from 
extracting dates that actually got extracted using no text extraction algorithm 
at all!

Please, check out NUTCH-1414 and see if it works for you. Hopefully, in your 
case, it will do what you want it to do. I decided a few years ago to get place 
the improved date extraction tool to a separate project and get rid of 
Boilerpipe altogether and build a new tool from scratch that can interface with 
a date extraction tool, and has support for looking up the exact spot of the 
document's date. It works on 95% of the many hundreds of real web page tests so 
if you need something that works at scale, you can contact me off list, the 
stuff has not been open sourced.

Have fun!
Markus

[1]: https://issues.apache.org/jira/browse/NUTCH-1414
 
-----Original message-----
> From:Jorge Luis Betancourt González <[email protected]>
> Sent: Tuesday 10th March 2015 4:23
> To: [email protected]
> Subject: Handling servers with wrong Last Modified HTTP header
> 
> Recently in the search app we are working on we've encountered a lot of 
> websites that have a wrong and invalid date in the Last Modified HTTP header, 
> meaning for instance that an article posted on a news site back in 2010 has a 
> Las Modified header of just a few days back, this could be for any number of 
> reasons:
> 
> - A new comment was added to the site
> - Some cache invalidation occurring in the source code of the website that 
> affects the article's page
> - Perhaps a new ad showing in the sidebar
> - Or just plain wrong header handling in the platform code
> 
> For what I've seen this is handled by several CMS even allowing to "tweak" 
> the published date, My question is basically if any one on the list has a 
> suggestion on how to tackle this or has some suggestion on how to address 
> this situation. For the particular case that we've been working most of the 
> URLs have the published date in the URL in the form of yyyy/mm/dd (or some 
> similar fashion), so this could be one way of "guessing" the publication date 
> of the article. I realize that this is no silver bullet but I'd love to get 
> some feedback on this type of situations. From my experience when people 
> usually filter by date in our frontend app, they usually are trying to get 
> news/articles by the publication date instead of the Last Modified date and 
> they are confused when the returned results have very old publication dates, 
> they usually don't check if is a new comment for instance.
> 
> I'm living the "how to implement this" a side for now, just interested in 
> discussing how to deal with this type of situations, as stated in our 
> particular case we can rely on the URL patterns for a very good portion, but 
> was hopping to agree on some general approach that could be integrated in 
> Nutch.
> 
> Regards,
> 
> PS: Should I post this also to the user list? 
>

RE: Handling servers with wrong Last Modified HTTP header

Reply via email to