[ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512930
 ] 

Doğacan Güney commented on NUTCH-439:
-------------------------------------

A big +1 from me. Though, it may be useful to break this patch into multiple 
pieces (fixes to opic and build system as a seperate patch, core changes as a 
seperate patch and plugin as a seperate patch).

IMHO, most usages of URL.getHost should be replaced with this patch's 
getDomainName. For example, "host" field in index gets a big boost currently. 
But it is easy to spam hosts. Just buy a host 'example.com' then set up your 
own dns and add 'foo.example.com', 'bar.example.com', 'baz.example.com'. I have 
actually seen a lot of spam sites that do this. Doing this in linkdb reduces 
anchor spam (where 'foo.example.com' gives a link to 'bar.example.com' and 
nutch considers this an external link and stores this anchor).

Another example is generator. Instead of partitioning on host or ip, we can 
partition urls based on their domains. This doesn't have the overhead of 
resolving ips (and ip-resolving also has problems. Urls under the same domain 
[sometimes even the same url] may be served from different ips [think load 
balancers and stuff]) and will be much more polite and resistant to honey pots.

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
> tld_plugin_v2.0.patch, tld_plugin_v2.1.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
> system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
> divides tlds into three. infrastructure, generic(such as "com", "edu") and 
> country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain 
> and optionally boosting is needed for improving the search results and 
> enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to