Markus Jelsma created NUTCH-1838:
------------------------------------

             Summary: Host based regex and automaton filtering
                 Key: NUTCH-1838
                 URL: https://issues.apache.org/jira/browse/NUTCH-1838
             Project: Nutch
          Issue Type: New Feature
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma


Both regex and automaton filter pass all URL's through all rules although this 
makes little sense if you have a lot of generated rules for many different 
hosts. This patch allows the users to configure specific rules for a specific 
host only, making filtering much more efficient.

Each rule has an optional host field, the filter is applied for rules that have 
no host and for URL's that match the rule's host name.

The following line enables host specific rules:
{code}
> www.example.org
{code}

The following line disables/resets it again:
{code}
<
{code}

full example:
{code}
-some generic filter
+another generic filter

> www.example.org
-rule only applied to URL's of www.example.org
+another rule only applied to URL's of www.example.org

> www.apache.org
-rule only applied to URL's of www.apache.org
+another rule only applied to URL's of www.apache.org

<
-more generic rules
+and another one
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to