[
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512930
]
Doğacan Güney commented on NUTCH-439:
-------------------------------------
A big +1 from me. Though, it may be useful to break this patch into multiple
pieces (fixes to opic and build system as a seperate patch, core changes as a
seperate patch and plugin as a seperate patch).
IMHO, most usages of URL.getHost should be replaced with this patch's
getDomainName. For example, "host" field in index gets a big boost currently.
But it is easy to spam hosts. Just buy a host 'example.com' then set up your
own dns and add 'foo.example.com', 'bar.example.com', 'baz.example.com'. I have
actually seen a lot of spam sites that do this. Doing this in linkdb reduces
anchor spam (where 'foo.example.com' gives a link to 'bar.example.com' and
nutch considers this an external link and stores this anchor).
Another example is generator. Instead of partitioning on host or ip, we can
partition urls based on their domains. This doesn't have the overhead of
resolving ips (and ip-resolving also has problems. Urls under the same domain
[sometimes even the same url] may be served from different ips [think load
balancers and stuff]) and will be much more polite and resistant to honey pots.
> Top Level Domains Indexing / Scoring
> ------------------------------------
>
> Key: NUTCH-439
> URL: https://issues.apache.org/jira/browse/NUTCH-439
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Affects Versions: 0.9.0
> Reporter: Enis Soztutar
> Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch,
> tld_plugin_v2.0.patch, tld_plugin_v2.1.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS
> system. TLDs are managed by the Internet Assigned Numbers Authority. IANA
> divides tlds into three. infrastructure, generic(such as "com", "edu") and
> country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain
> and optionally boosting is needed for improving the search results and
> enhancing locality.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers