hi there,

I found an interesting field in Nutch index directory,
called "digest". Seems it is a hashed signature for a
fetched page content. Is that true? 

I verified my guess by checking same page in two
different crawling round. The value of this field are
the same for both segments.

Essentially, I plan to check the updating status for a
page I crawling. If there is no change (means no
updating yet), I won't index this page to my search
engine. To achieve this function, I will compare the
"digest" fields of two pages with same URL.

Is it the right approach? Does Nutch provide an API
call to check the updating status for a particular web
page?

thanks,

Michael,


                
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search. 
http://info.mail.yahoo.com/mail_250

Reply via email to