[ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515650
 ] 

Doğacan Güney commented on NUTCH-439:
-------------------------------------

If there are no objections, I am going to commit this one. 

This is a big change, but it is almost completely self contained (besides the 
tld plugin which is disabled by default), so there should be no harm in 
committing it. Later, we can discuss whether it is useful to replace 
URL.getHost with getDomainName on a case-by-case basis. 

(FWIW, I think scoring-opic and linkdb should use domain name instead of host.)

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
> tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
> system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
> divides tlds into three. infrastructure, generic(such as "com", "edu") and 
> country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain 
> and optionally boosting is needed for improving the search results and 
> enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to