[ 
https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1838:
---------------------------------
    Description: 
Both regex and automaton filter pass all URL's through all rules although this 
makes little sense if you have a lot of generated rules for many different 
hosts or domains. This patch allows the users to configure specific rules for a 
specific host or domain only, making filtering much more efficient.

Each rule has an optional hostOrDomain field, the filter is applied for rules 
that have no hostOrDomain and for URL's that match the rule's host name and 
domain name.

The following line enables hostOrDomain specific rules:
{code}
> www.example.org
{code}

The following line disables/resets it again:
{code}
<
{code}

full example:
{code}
-some generic filter
+another generic filter

> www.example.org
-rule only applied to URL's of www.example.org
+another rule only applied to URL's of www.example.org

> apache.org
-rule only applied to URL's of apache.org
+another rule only applied to URL's of apache.org

<
-more generic rules
+and another one
{code}

  was:
Both regex and automaton filter pass all URL's through all rules although this 
makes little sense if you have a lot of generated rules for many different 
hosts. This patch allows the users to configure specific rules for a specific 
host only, making filtering much more efficient.

Each rule has an optional host field, the filter is applied for rules that have 
no host and for URL's that match the rule's host name.

The following line enables host specific rules:
{code}
> www.example.org
{code}

The following line disables/resets it again:
{code}
<
{code}

full example:
{code}
-some generic filter
+another generic filter

> www.example.org
-rule only applied to URL's of www.example.org
+another rule only applied to URL's of www.example.org

> www.apache.org
-rule only applied to URL's of www.apache.org
+another rule only applied to URL's of www.apache.org

<
-more generic rules
+and another one
{code}


> Host and domain based regex and automaton filtering
> ---------------------------------------------------
>
>                 Key: NUTCH-1838
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1838
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, 
> NUTCH-1838.patch
>
>
> Both regex and automaton filter pass all URL's through all rules although 
> this makes little sense if you have a lot of generated rules for many 
> different hosts or domains. This patch allows the users to configure specific 
> rules for a specific host or domain only, making filtering much more 
> efficient.
> Each rule has an optional hostOrDomain field, the filter is applied for rules 
> that have no hostOrDomain and for URL's that match the rule's host name and 
> domain name.
> The following line enables hostOrDomain specific rules:
> {code}
> > www.example.org
> {code}
> The following line disables/resets it again:
> {code}
> <
> {code}
> full example:
> {code}
> -some generic filter
> +another generic filter
> > www.example.org
> -rule only applied to URL's of www.example.org
> +another rule only applied to URL's of www.example.org
> > apache.org
> -rule only applied to URL's of apache.org
> +another rule only applied to URL's of apache.org
> <
> -more generic rules
> +and another one
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to