[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512930 ]
Doğacan Güney commented on NUTCH-439: ------------------------------------- A big +1 from me. Though, it may be useful to break this patch into multiple pieces (fixes to opic and build system as a seperate patch, core changes as a seperate patch and plugin as a seperate patch). IMHO, most usages of URL.getHost should be replaced with this patch's getDomainName. For example, "host" field in index gets a big boost currently. But it is easy to spam hosts. Just buy a host 'example.com' then set up your own dns and add 'foo.example.com', 'bar.example.com', 'baz.example.com'. I have actually seen a lot of spam sites that do this. Doing this in linkdb reduces anchor spam (where 'foo.example.com' gives a link to 'bar.example.com' and nutch considers this an external link and stores this anchor). Another example is generator. Instead of partitioning on host or ip, we can partition urls based on their domains. This doesn't have the overhead of resolving ips (and ip-resolving also has problems. Urls under the same domain [sometimes even the same url] may be served from different ips [think load balancers and stuff]) and will be much more polite and resistant to honey pots. > Top Level Domains Indexing / Scoring > ------------------------------------ > > Key: NUTCH-439 > URL: https://issues.apache.org/jira/browse/NUTCH-439 > Project: Nutch > Issue Type: New Feature > Components: indexer > Affects Versions: 0.9.0 > Reporter: Enis Soztutar > Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, > tld_plugin_v2.0.patch, tld_plugin_v2.1.patch > > > Top Level Domains (tlds) are the last part(s) of the host name in a DNS > system. TLDs are managed by the Internet Assigned Numbers Authority. IANA > divides tlds into three. infrastructure, generic(such as "com", "edu") and > country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain > and optionally boosting is needed for improving the search results and > enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.