In some cases, though, focused crawling requirements may require extra data to be stored, which is not useful for whole-web, for example, storing a url's parent and seed url and its depth (essential for crawl scopes).

Sounds like meta data for a page. :)
Some time ago I submit a patch to the issue tracking, we use this meta data here in a project to decide if the page should be crawled or not.. and to give meta data from a 'mother pager' to a child.

I still believe flexible page meta data would be a big help in many cases and I believe that to map reduce and 'merge-in' meta data as Doug suggest it, isn't that powerful, since a identically key for the page and the meta datum are required.
Just my 2 cents..
Greetings,
Stefan


Reply via email to