[
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848517#comment-13848517
]
Tejas Patil commented on NUTCH-1325:
------------------------------------
Hi [~markus17],
I stopped by this Jira (after a long time !!!) with an intention of getting it
to a stage where we could have it inside trunk.
You had replied to my two concerns.
For (1):
{noformat}host_a.example.org, host_b.example.org ==> example.org{noformat}
This might *NOT* be a good idea.
(a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted
independently. It can be argued to consider them as different hosts.
(b) I am not sure about the standards, but if something like "uci.cs.edu" is
valid (subdomain is suffix of domain) then there would be a problem when we
resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu".
For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We
have a modified domain filter that optionally takes a scheme so we can force
HTTPS for specific domains. Those domains are filtered out because HTTP is not
allowed."
Do you have any suggestion to work this out ?
> HostDB for Nutch
> ----------------
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database
> containing information on hosts.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)