[
https://issues.apache.org/jira/browse/NUTCH-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398836#comment-13398836
]
Markus Jelsma commented on NUTCH-1407:
--------------------------------------
We usually filter subscribers by host or a small group of hosts. This is,
however, not feasible for subscribers with millions of sub domains. It is, in
Solr, possible to achieve with copyFields and some regular expressions or a
custom update processor but that is cumbersome. Doing it with Nutch and URLUtil
has also the advantage that it understands domains with more than one
extension/suffix.
> BasicIndexingFilter to optionally add domain field
> --------------------------------------------------
>
> Key: NUTCH-1407
> URL: https://issues.apache.org/jira/browse/NUTCH-1407
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1407-1.6-1.patch
>
>
> The basic indexing filter already adds the host field to a NutchDocument but
> no domain field. In Solr you can copyField a host field and obtain a domain
> field but this is a bit cumbersome and not very user friendly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira