Additional URL Content

John Martyniak Thu, 30 Oct 2008 04:27:19 -0700

Hello everyone,

Part of the requirements for a site that I am working on is that Ihave some information in a DB and some in a nutch index.

The nutch index obviously contains the indexed URLs, etc. However Ialso have a DB that contains the URLs and a bunch of other informationabout the URL, for example comments, ranking, etc.

What is the best way to update the DB based on what the spider finds.One approach I was thinking of was to not update the DB until somebodyactually requests the data, and then add the URLs to the DB at thattime. It is kind of backwards especially if there is other data thatneeds to be collected from external sites, etc.

The second question is I have I would like to tag the URL that wasfetched indexed, as a particular type, for example RSS feeds would gettagged one way, HTML another way. So that when I perform a queryagainst the Nutch index I can return only RSS, or return only Web. Isthe best way to do this some sort of tagging with a plugin? Or cannutch perform that "out of the box"?


Thanks in advance for all of the advice.

-John

Additional URL Content

Reply via email to