Markus Jelsma created NUTCH-1838:
------------------------------------
Summary: Host based regex and automaton filtering
Key: NUTCH-1838
URL: https://issues.apache.org/jira/browse/NUTCH-1838
Project: Nutch
Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Both regex and automaton filter pass all URL's through all rules although this
makes little sense if you have a lot of generated rules for many different
hosts. This patch allows the users to configure specific rules for a specific
host only, making filtering much more efficient.
Each rule has an optional host field, the filter is applied for rules that have
no host and for URL's that match the rule's host name.
The following line enables host specific rules:
{code}
> www.example.org
{code}
The following line disables/resets it again:
{code}
<
{code}
full example:
{code}
-some generic filter
+another generic filter
> www.example.org
-rule only applied to URL's of www.example.org
+another rule only applied to URL's of www.example.org
> www.apache.org
-rule only applied to URL's of www.apache.org
+another rule only applied to URL's of www.apache.org
<
-more generic rules
+and another one
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)