Elena wrote:
Hello everyone,

I am using Nutch with the Solr plugin, and I am having a problem indexing
redirected url´s. While Solr generates its fields just fine, as if they
belonged to the redirected url, Nutch leaves the summary field empty. It
seems as if Nutch tries to generate the summary of the original url and then
makes the query to Solr, which then follows the redirect and fills the rest
of the fields using the final url. But I am not quite sure of this.

It depends on what version of Nutch you are using. This was a problem with some older Trunk versions. The problem is that Nutch has the concept of a representative url for redirects. Redirects have an original and a redirected to url. Logic dictates which of those is stored as the url and which is displayed on search results pages. Most of the problems which this mismatch have been fixed in recent patches and should be deployed out in a new 1.0 release in the next week or so.


I would like to know what is the way Nutch generates summaries, why it
leaves them empty when redirecting. Perharps there is a command to generate
one field in particular, after the indexing is done.

Summaries are generated, at query time, from the full text of the web page stored in ParseText under segments. The org.apache.nutch.searcher.Summarizer plugins are what actually returns the summary text. By default it uses the summary-basic plugin.

Dennis

Thanks!

Reply via email to