Adding ID's to the index generated by Nutch

Vince Filby Fri, 10 Aug 2007 11:46:37 -0700

I have a list of businesses, urls and extra information that I need to
crawl. I have used Nutch to crawl this list without following external link
and it seems to be working well, but I need to relate the crawled web text
data (including all pages and sub-pages within the original domain) with the
original business record in the database.  I need to add an ID field into
each document in the generated index that references the business ID.


How can I do this with Nutch?  Can it be done at inject/fetch time or will I
have to try to match urls to ID's after the index is generated.

Cheers,
Vince

Adding ID's to the index generated by Nutch

Reply via email to