Feng \(Michael\) Ji
Wed, 03 Aug 2005 20:30:48 -0700
hi there,
I found an interesting field in Nutch index directory,
called "digest". Seems it is a hashed signature for a
fetched page content. Is that true?
I verified my guess by checking same page in two
different crawling round. The value of this field are
the same for both segments.
Essentially, I plan to check the updating status for a
page I crawling. If there is no change (means no
updating yet), I won't index this page to my search
engine. To achieve this function, I will compare the
"digest" fields of two pages with same URL.
Is it the right approach? Does Nutch provide an API
call to check the updating status for a particular web
page?
thanks,
Michael,
__________________________________
Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250