Not sure if this would be the easiest solution but you might want to have a look at

        http://wiki.apache.org/nutch/WritingPluginExample-0.9

I have used it as template code to add fields to my index.

Hope this helps,

Jasper

On Aug 10, 2007, at 11:46 AM, Vince Filby wrote:

I have a list of businesses, urls and extra information that I need to
crawl. I have used Nutch to crawl this list without following external link and it seems to be working well, but I need to relate the crawled web text data (including all pages and sub-pages within the original domain) with the original business record in the database. I need to add an ID field into
each document in the generated index that references the business ID.

How can I do this with Nutch? Can it be done at inject/fetch time or will I
have to try to match urls to ID's after the index is generated.

Cheers,
Vince

Reply via email to