[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13656042#comment-13656042
 ] 

Markus Jelsma commented on NUTCH-1325:
--------------------------------------

Hi Tejas - you're right for (1), it should indeed be host_a.example.org, 
host_b.example.org ==> example.org but not x.xyz.org, a.abc.org ==> unknown. 
The reducer should take the domain + suffix as key and then emit the domain if 
*ALL* hosts are unknown. If you emit a domain if most but not all hosts are 
unknown, the DomainBlacklistURLFilter will remove the entire domain from the 
CrawlDB and WebgraphDB.

The example for (2) does not include cross-domain redirects but the problem is 
similar. I think it works fine for now because multi-redirects are not very 
common on the entire internet.

A larger problem is the filterNormalize() method. It actually receives a 
hostname, not a URL. And to pass URL filters we must prepend the URL scheme to 
make it look like a URL. I use the HTTP:// scheme but not all hosts may allow 
that scheme. We have a modified domain filter that optionally takes a scheme so 
we can force HTTPS for specific domains. Those domains are filtered out because 
HTTP is not allowed.

I think i've got a slightly newer version of the tools but don't know what 
actually changed in the past year. I'll try to diff and upload it.
                
> HostDB for Nutch
> ----------------
>
>                 Key: NUTCH-1325
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1325
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.7
>
>         Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to