Hello everyone,
Part of the requirements for a site that I am working on is that I
have some information in a DB and some in a nutch index.
The nutch index obviously contains the indexed URLs, etc. However I
also have a DB that contains the URLs and a bunch of other information
about the URL, for example comments, ranking, etc.
What is the best way to update the DB based on what the spider finds.
One approach I was thinking of was to not update the DB until somebody
actually requests the data, and then add the URLs to the DB at that
time. It is kind of backwards especially if there is other data that
needs to be collected from external sites, etc.
The second question is I have I would like to tag the URL that was
fetched indexed, as a particular type, for example RSS feeds would get
tagged one way, HTML another way. So that when I perform a query
against the Nutch index I can return only RSS, or return only Web. Is
the best way to do this some sort of tagging with a plugin? Or can
nutch perform that "out of the box"?
Thanks in advance for all of the advice.
-John