[ 
https://issues.apache.org/jira/browse/NUTCH-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lihachev updated NUTCH-737:
----------------------------------

    Attachment: NUTCH-737_urlfilter_unalias.patch

> urlnormalizer-unalias plugin
> ----------------------------
>
>                 Key: NUTCH-737
>                 URL: https://issues.apache.org/jira/browse/NUTCH-737
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>            Reporter: Dmitry Lihachev
>            Priority: Minor
>         Attachments: NUTCH-737_urlfilter_unalias.patch
>
>
> I tried to search any whole site duplication detection tools without success. 
> This plugin allows to do domain name transformation (for example 
> www.google.com -> google.com). It is very stupid, but can be useful when 
> fighting with site aliases. For detect site aliases I use my own ugly class 
> (based on SolrDeleteDuplicates).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to