Hi all,
I'm interested in tracking changes to pages between crawls. I want to be able
to log new pages added since the last crawl, updates to existing pages as well
as any pages that have been removed. I think I can determine if a page has
been updated by comparing the MD5 hash of the two pages.
In looking at the code, it appears that it's the 'Content' of the Page that is
hashed - so if I want to compare two pages using this technique, I'm
essentially comparing the 'Content' of the two pages. My question is - does
this mean that additional changes to a page cannot be tracked. For example -
changes to tags, meta-data, etc?
What exactly constitutes 'Content' within Nutch? I understand that this
depends on what parser you're using but if we're talking about HTML pages, is
the 'Content' of the page, the text between the <body> tags or all of the text
between all of the tags?
I think I could compare other page attributes like the title or meta-data of
the page using the ParseData class but I'm a little apprehensive that I'll
still be missing out on other changes to the page.
Any ideas/suggestions? The answers to these questions are probably quite
obvious but I'd really appreciate any help you can provide me with : )
Thanks,
Karen