hi there,
I found an interesting field in Nutch index directory,
called "digest". Seems it is a hashed signature for a
fetched page content. Is that true?
I verified my guess by checking same page in two
different crawling round. The value of this field are
the same for both segments.
Essentially, I plan to check the updating status for a
page I crawling. If there is no change (means no
updating yet), I won't index this page to my search
engine. To achieve this function, I will compare the
"digest" fields of two pages with same URL.
Is it the right approach? Does Nutch provide an API
call to check the updating status for a particular web
page?
thanks,
Michael,
__________________________________
Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers