Karen Church wrote:
Hi all,

I'm interested in tracking changes to pages between crawls.  I want
to be able to log new pages added since the last crawl, updates to
existing pages as well as any pages that have been removed.  I think
I can determine if a page has been updated by comparing the MD5 hash
of the two pages.

In looking at the code, it appears that it's the 'Content' of the
Page that is hashed - so if I want to compare two pages using this
technique, I'm essentially comparing the 'Content' of the two pages.
My question is - does this mean that additional changes to a page
cannot be tracked.  For example - changes to tags, meta-data, etc?

They are all tracked, because for every change in the content the md5 is changed. But they are not tracked separately.

What exactly constitutes 'Content' within Nutch?  I understand that

I assume you ask about the semantics of byte[] Content.getContent(), right? This content is the protocol payload. In case of HTTP, this is the response body as stream. In case of FTP, this is the file content; etc...

I think I could compare other page attributes like the title or
meta-data of the page using the ParseData class but I'm a little
apprehensive that I'll still be missing out on other changes to the
page.

Hence the md5 checksum, which is calculated from the whole byte[] content.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to