Karen Church wrote:
Hi all,
I'm interested in tracking changes to pages between crawls. I want
to be able to log new pages added since the last crawl, updates to
existing pages as well as any pages that have been removed. I think
I can determine if a page has been updated by comparing the MD5 hash
of the two pages.
In looking at the code, it appears that it's the 'Content' of the
Page that is hashed - so if I want to compare two pages using this
technique, I'm essentially comparing the 'Content' of the two pages.
My question is - does this mean that additional changes to a page
cannot be tracked. For example - changes to tags, meta-data, etc?
They are all tracked, because for every change in the content the md5 is
changed. But they are not tracked separately.
What exactly constitutes 'Content' within Nutch? I understand that
I assume you ask about the semantics of byte[] Content.getContent(),
right? This content is the protocol payload. In case of HTTP, this is
the response body as stream. In case of FTP, this is the file content;
etc...
I think I could compare other page attributes like the title or
meta-data of the page using the ParseData class but I'm a little
apprehensive that I'll still be missing out on other changes to the
page.
Hence the md5 checksum, which is calculated from the whole byte[] content.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com