Hello Jorge, This is an interesting but very complicated issue. First of all, do not rely on HTTP headers, they are incorrect on any scale larger than very small. This is true for Last-Modified due to dynamic CMS' but for many other headers. You can even expect website descriptions in headers such as Content-Type, madness!
The only reliable source of a document's date and optionally time is within the document itself. This introduces two news problems, 1) what format and language, and 2) where exactly can you find it. Let's discuss these two issues. The first is the most straightforward to deal with, it is a two-stage process. First you need to extract anything that resembles a date format that is used on Earth, this includes non-numeric dates such as month names. Then you have to pass all those date candidates through a series of carefully aligned date formats (SimpleDateFormat) and set the appropriate Locale. This stage requires that you have identified the language of the document, or the part of the document you are processing in case of multi-language documents. Luckily, i have uploaded preliminary work as a Nutch parse-plugin a few years ago that does exactly this, check out NUTCH-1414 [1]. You present the extractor with a language and a piece of text, in this case the document's extracted text. It is very basic and has many flaws but it should work nicely if you present it with concise fragments of text. The second part of the solution is more cumbersome to deal with. NUTCH-1414 uses the document's extracted text as source for date extraction, and it has really no clue as to where the date is located in the document's structure. If you use Nutch' basic text extraction (extract all TEXT nodes) you will get bad results for most documents. It can be partially solved by relying on Boilerpipe's text extraction. But using Boilerpipe may in turn prevent you from extracting dates that actually got extracted using no text extraction algorithm at all! Please, check out NUTCH-1414 and see if it works for you. Hopefully, in your case, it will do what you want it to do. I decided a few years ago to get place the improved date extraction tool to a separate project and get rid of Boilerpipe altogether and build a new tool from scratch that can interface with a date extraction tool, and has support for looking up the exact spot of the document's date. It works on 95% of the many hundreds of real web page tests so if you need something that works at scale, you can contact me off list, the stuff has not been open sourced. Have fun! Markus [1]: https://issues.apache.org/jira/browse/NUTCH-1414 -----Original message----- > From:Jorge Luis Betancourt González <[email protected]> > Sent: Tuesday 10th March 2015 4:23 > To: [email protected] > Subject: Handling servers with wrong Last Modified HTTP header > > Recently in the search app we are working on we've encountered a lot of > websites that have a wrong and invalid date in the Last Modified HTTP header, > meaning for instance that an article posted on a news site back in 2010 has a > Las Modified header of just a few days back, this could be for any number of > reasons: > > - A new comment was added to the site > - Some cache invalidation occurring in the source code of the website that > affects the article's page > - Perhaps a new ad showing in the sidebar > - Or just plain wrong header handling in the platform code > > For what I've seen this is handled by several CMS even allowing to "tweak" > the published date, My question is basically if any one on the list has a > suggestion on how to tackle this or has some suggestion on how to address > this situation. For the particular case that we've been working most of the > URLs have the published date in the URL in the form of yyyy/mm/dd (or some > similar fashion), so this could be one way of "guessing" the publication date > of the article. I realize that this is no silver bullet but I'd love to get > some feedback on this type of situations. From my experience when people > usually filter by date in our frontend app, they usually are trying to get > news/articles by the publication date instead of the Last Modified date and > they are confused when the returned results have very old publication dates, > they usually don't check if is a new comment for instance. > > I'm living the "how to implement this" a side for now, just interested in > discussing how to deal with this type of situations, as stated in our > particular case we can rely on the URL patterns for a very good portion, but > was hopping to agree on some general approach that could be integrated in > Nutch. > > Regards, > > PS: Should I post this also to the user list? >

