Track changes to pages between crawls?

Karen Church Thu, 16 Jun 2005 07:29:37 -0700
Hi all,
 
I'm interested in tracking changes to pages between crawls.  I want to be able 
to log new pages added since the last crawl, updates to existing pages as well 
as any pages that have been removed.  I think I can determine if a page has 
been updated by comparing the MD5 hash of the two pages.
 
In looking at the code, it appears that it's the 'Content' of the Page that is 
hashed - so if I want to compare two pages using this technique, I'm 
essentially comparing the 'Content' of the two pages.  My question is - does 
this mean that additional changes to a page cannot be tracked.  For example - 
changes to tags, meta-data, etc?
 
What exactly constitutes 'Content' within Nutch?  I understand that this 
depends on what parser you're using but if we're talking about HTML pages, is 
the 'Content' of the page, the text between the <body> tags or all of the text 
between all of the tags?
 
I think I could compare other page attributes like the title or meta-data of 
the page using the ParseData class but I'm a little apprehensive that I'll 
still be missing out on other changes to the page. 
 
Any ideas/suggestions? The answers to these questions are probably quite 
obvious but I'd really appreciate any help you can provide me with : )
 
Thanks,
Karen
Track changes to pages between crawls?

Reply via email to