Isn't he in fact NOT using the US date notation?  AFAIK, the US date notation 
is mm/dd/yyyy.  

Russ
------Original Message------
From: Andrzej Bialecki
To: [email protected]
ReplyTo: [email protected]
Sent: Sep 18, 2008 11:18 AM
Subject: Re: Dedup

David Jashi wrote:
> Hello, colleagues.
> 
> I have a theoretical question - let's say
> on 01/01/2008 we have crawled page http://www.site.com/page.html
> on 10/01/2008 the page changed
> on 01/02/2008 we crawled it once again and merged old and new indexes
> 
> which version of this page Nutch dedup will leave in index?

If we assume that you're using the US date notation (how quaint ;) ), 
then yes - Dedup always keeps the latest version of the page with the 
same url, and discards all other versions.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Sent from my Verizon Wireless BlackBerry

Reply via email to