Hi, Is this bug or I m missing something ?
I have crawled many urls using Nutch-0.9. When I query the index created using the crawl, some results are duplicate. How nutch decides the urls are duplicate ? Is it on URL string matching or based on content of pages? for example content of the pages are same but urls are not same because of "/","//" and "///". http://www.indianholiday.com/india-wildlife-holidays/index.html ^^^ http://www.indianholiday.com//india-wildlife-holidays/index.html ^^^^ http://www.indianholiday.com///india-wildlife-holidays/index.html ^^^^ Any idea how to remove this kind of duplicate pages from the crawl. Thanks in advance!! -- Thanks and Regards, Vishal Vachhani
