Duplicate pages in result of queries

vishal vachhani Sun, 21 Sep 2008 09:54:50 -0700

Hi,

Is this bug or I m missing something ?


I have crawled many urls using Nutch-0.9. When I query the index created
using the crawl, some results are duplicate.

How nutch decides the urls are duplicate ? Is it on URL string matching or
based on content of pages?

for example content of the pages are same but urls are not same because of
"/","//" and "///".

http://www.indianholiday.com/india-wildlife-holidays/index.html
                                         ^^^
http://www.indianholiday.com//india-wildlife-holidays/index.html
                                         ^^^^
http://www.indianholiday.com///india-wildlife-holidays/index.html
                                          ^^^^

Any idea how to remove this kind of duplicate pages from the crawl.

Thanks in advance!!

-- 
Thanks and Regards,
Vishal Vachhani

Duplicate pages in result of queries

Reply via email to