I have a list of businesses, urls and extra information that I need to crawl. I have used Nutch to crawl this list without following external link and it seems to be working well, but I need to relate the crawled web text data (including all pages and sub-pages within the original domain) with the original business record in the database. I need to add an ID field into each document in the generated index that references the business ID.
How can I do this with Nutch? Can it be done at inject/fetch time or will I have to try to match urls to ID's after the index is generated. Cheers, Vince
