Hello everyone,

Part of the requirements for a site that I am working on is that I have some information in a DB and some in a nutch index.

The nutch index obviously contains the indexed URLs, etc. However I also have a DB that contains the URLs and a bunch of other information about the URL, for example comments, ranking, etc.

What is the best way to update the DB based on what the spider finds. One approach I was thinking of was to not update the DB until somebody actually requests the data, and then add the URLs to the DB at that time. It is kind of backwards especially if there is other data that needs to be collected from external sites, etc.

The second question is I have I would like to tag the URL that was fetched indexed, as a particular type, for example RSS feeds would get tagged one way, HTML another way. So that when I perform a query against the Nutch index I can return only RSS, or return only Web. Is the best way to do this some sort of tagging with a plugin? Or can nutch perform that "out of the box"?

Thanks in advance for all of the advice.

-John

Reply via email to