Andrzej Bialecki wrote:
For efficiency reasons, most of this information is stored and passed to processing jobs inside instances of CrawlDatum - for the key step of DB update any other parts of segments (such as Content, ParseData or ParseText) are not used, which prevents easy access to other page metadata. For now, I added both the signature and the modifiedTime to CrawlDatum as separate attributes, but I'm considering to put them (and any other values that users might want to add to CrawlDB) into a Properties attribute.

Yes, I agree that CrawlDatum should have extensible properties. If these are empty, then no Properties instance should be allocated.

This is great stuff.  I look forward to getting it committed!

Doug

Reply via email to