Duplicate URLs with slightly different URIs.. how to normalize?

Brian Whitman Tue, 02 Jan 2007 14:08:31 -0800

I'm using Solr to search the Nutch Lucene index (can't use the nutchsearcher in our current app.) Using the latest Nutch nightly.

There are a lot of duplicate URLs in the Lucene index-- http://url.com/ vs. http://url.com are two different Lucene documents, asare http://url.com/index.html#a and http://url.com/index.html

The Nutch search jsp seems to have some intelligence to remove theduplicates -- with the "show all hits" toggle button at the end ofthe results.

Is there a tool to remove duplicates directly from the Lucene index?I do call 'nutch dedup' in my crawl script but it doesn't seem toaffect the results.

Duplicate URLs with slightly different URIs.. how to normalize?

Reply via email to