I have a list of businesses, urls and extra information that I need to
crawl. I have used Nutch to crawl this list without following external link
and it seems to be working well, but I need to relate the crawled web text
data (including all pages and sub-pages within the original domain) with the
original business record in the database.  I need to add an ID field into
each document in the generated index that references the business ID.

How can I do this with Nutch?  Can it be done at inject/fetch time or will I
have to try to match urls to ID's after the index is generated.

Cheers,
Vince

Reply via email to