Re: [Nutch-general] Duplicate URLs with slightly different URIs.. how to normalize?

Dennis Kubes Wed, 03 Jan 2007 07:45:52 -0800

The dedup will remove both exact same urls and exact same content (based 
on a hash of the content), so if there are any differences in the 
content (even a single letter) then the hashes won't match and it will 
be considered different content.  You can do a couple of different 
things to fix this.

One is to create a new UrlNormalizer and then re-parse the content 
through the parse command.  In the normalizer you would fix the 
differences and when re-parsed it would come out as the same url which 
would then be removed by the dedup process later.  To do this though you 
would have to re-parse, re-index and then dedup.  Another option is a 
url filter that simply removes urls with the #a as they are internal 
links.  Again you would need to re-parse, etc.

Let me know if you need more information on how to do this.

Dennis Kubes

Brian Whitman wrote:
> I'm using Solr to search the Nutch Lucene index (can't use the nutch 
> searcher in our current app.) Using the latest Nutch nightly.
>
> There are a lot of duplicate URLs in the Lucene index-- 
> http://url.com/ vs. http://url.com are two different Lucene documents, 
> as are http://url.com/index.html#a and http://url.com/index.html
>
> The Nutch search jsp seems to have some intelligence to remove the 
> duplicates -- with the "show all hits" toggle button at the end of the 
> results.
>
> Is there a tool to remove duplicates directly from the Lucene index? I 
> do call 'nutch dedup' in my crawl script but it doesn't seem to affect 
> the results.
>
>
>
>
>
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Duplicate URLs with slightly different URIs.. how to normalize?

Reply via email to