Re: Problem generating summaries for redirected url´s

Dennis Kubes Tue, 25 Nov 2008 12:22:02 -0800


Elena wrote:

Hello everyone,

I am using Nutch with the Solr plugin, and I am having a problem indexing
redirected url´s. While Solr generates its fields just fine, as if they
belonged to the redirected url, Nutch leaves the summary field empty. It
seems as if Nutch tries to generate the summary of the original url and then
makes the query to Solr, which then follows the redirect and fills the rest
of the fields using the final url. But I am not quite sure of this.

It depends on what version of Nutch you are using. This was a problemwith some older Trunk versions. The problem is that Nutch has theconcept of a representative url for redirects. Redirects have anoriginal and a redirected to url. Logic dictates which of those isstored as the url and which is displayed on search results pages. Mostof the problems which this mismatch have been fixed in recent patchesand should be deployed out in a new 1.0 release in the next week or so.


I would like to know what is the way Nutch generates summaries, why it
leaves them empty when redirecting. Perharps there is a command to generate
one field in particular, after the indexing is done.

Summaries are generated, at query time, from the full text of the webpage stored in ParseText under segments. Theorg.apache.nutch.searcher.Summarizer plugins are what actually returnsthe summary text. By default it uses the summary-basic plugin.


Dennis

Thanks!

Re: Problem generating summaries for redirected url´s

Reply via email to