Unfortunately, this function can't be implemented by just a simple API call. You have to write some code yourself. There is a field named "signature" in the "CrawlDatum" class, and this value is stored permanently in the crawlDB,and you'll get a new signature if the new page content you get is different from last time. So very time ,when you are updating crawldb from the new fetched segments, you can check whether the signature from the crawlDatum in segment is the same with the old one in the crawlDB. You can begin with "CrawlDB.java " try to understand the updatdb process, and also take a look at Nutch-61.
----- Original Message ----- From: "Miao Liqiang NCS" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Monday, May 12, 2008 9:21 AM Subject: RE: Unable to tell if whether is any changes for the same webpage Someone please response, many thanks. -----Original Message----- From: Miao Liqiang NCS Sent: Monday, May 05, 2008 11:38 AM To: [email protected] Subject: RE: Unable to tell if whether is any changes for the same webpage In which way the nutch informs there are changes? Am I able to know whether there are changes or not? If nutch knows there are changes internally, can I know that from outside through API or sonethging? -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, May 05, 2008 11:32 AM To: [email protected] Subject: Re: Unable to tell if whether is any changes for the same webpage It's part of Nutch, happens automatically. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Miao Liqiang NCS <[EMAIL PROTECTED]> > To: [email protected] > Sent: Sunday, May 4, 2008 8:33:49 PM > Subject: RE: Unable to tell if whether is any changes for the same webpage > > Is this function provided in the nutch package, can I use it directly > without programming the API? > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Friday, May 02, 2008 8:12 PM > To: [email protected] > Subject: Re: Unable to tell if whether is any changes for the same > webpage > > It uses MD5 of the page content and another method whose exact name I > cannot remember now, but that is more forgiving of small textual > changes. I think it also takes into consideration the Last-Modified > HTTP response header, but I'd have to check that. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > ----- Original Message ---- > > From: Miao Liqiang NCS > > To: [email protected] > > Sent: Friday, May 2, 2008 8:29:38 AM > > Subject: RE: Unable to tell if whether is any changes for the same > webpage > > > > Could you tell me how? > > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > Sent: Friday, May 02, 2008 2:12 PM > > To: [email protected] > > Subject: Re: Unable to tell if whether is any changes for the same > > webpage > > > > Hi, > > > > Yes, Nutch can detect when a page changed and when it didn't change. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > > From: Miao Liqiang NCS > > > To: [email protected] > > > Sent: Friday, May 2, 2008 7:48:03 AM > > > Subject: Unable to tell if whether is any changes for the same > webpage > > > > > > Is Nutch able to tell whether there are any changes for the same > > > webpage? For example, a webpage has been updated since last > crawling, > > is > > > nutch can tell this change of the webpage when there is a > recrawling? > > > > > > > > >
