[
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13656042#comment-13656042
]
Markus Jelsma commented on NUTCH-1325:
--------------------------------------
Hi Tejas - you're right for (1), it should indeed be host_a.example.org,
host_b.example.org ==> example.org but not x.xyz.org, a.abc.org ==> unknown.
The reducer should take the domain + suffix as key and then emit the domain if
*ALL* hosts are unknown. If you emit a domain if most but not all hosts are
unknown, the DomainBlacklistURLFilter will remove the entire domain from the
CrawlDB and WebgraphDB.
The example for (2) does not include cross-domain redirects but the problem is
similar. I think it works fine for now because multi-redirects are not very
common on the entire internet.
A larger problem is the filterNormalize() method. It actually receives a
hostname, not a URL. And to pass URL filters we must prepend the URL scheme to
make it look like a URL. I use the HTTP:// scheme but not all hosts may allow
that scheme. We have a modified domain filter that optionally takes a scheme so
we can force HTTPS for specific domains. Those domains are filtered out because
HTTP is not allowed.
I think i've got a slightly newer version of the tools but don't know what
actually changed in the past year. I'll try to diff and upload it.
> HostDB for Nutch
> ----------------
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.7
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database
> containing information on hosts.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira