[
https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1838:
---------------------------------
Description:
Both regex and automaton filter pass all URL's through all rules although this
makes little sense if you have a lot of generated rules for many different
hosts or domains. This patch allows the users to configure specific rules for a
specific host or domain only, making filtering much more efficient.
Each rule has an optional hostOrDomain field, the filter is applied for rules
that have no hostOrDomain and for URL's that match the rule's host name and
domain name.
The following line enables hostOrDomain specific rules:
{code}
> www.example.org
{code}
The following line disables/resets it again:
{code}
<
{code}
full example:
{code}
-some generic filter
+another generic filter
> www.example.org
-rule only applied to URL's of www.example.org
+another rule only applied to URL's of www.example.org
> apache.org
-rule only applied to URL's of apache.org
+another rule only applied to URL's of apache.org
<
-more generic rules
+and another one
{code}
was:
Both regex and automaton filter pass all URL's through all rules although this
makes little sense if you have a lot of generated rules for many different
hosts. This patch allows the users to configure specific rules for a specific
host only, making filtering much more efficient.
Each rule has an optional host field, the filter is applied for rules that have
no host and for URL's that match the rule's host name.
The following line enables host specific rules:
{code}
> www.example.org
{code}
The following line disables/resets it again:
{code}
<
{code}
full example:
{code}
-some generic filter
+another generic filter
> www.example.org
-rule only applied to URL's of www.example.org
+another rule only applied to URL's of www.example.org
> www.apache.org
-rule only applied to URL's of www.apache.org
+another rule only applied to URL's of www.apache.org
<
-more generic rules
+and another one
{code}
> Host and domain based regex and automaton filtering
> ---------------------------------------------------
>
> Key: NUTCH-1838
> URL: https://issues.apache.org/jira/browse/NUTCH-1838
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Attachments: NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch,
> NUTCH-1838.patch
>
>
> Both regex and automaton filter pass all URL's through all rules although
> this makes little sense if you have a lot of generated rules for many
> different hosts or domains. This patch allows the users to configure specific
> rules for a specific host or domain only, making filtering much more
> efficient.
> Each rule has an optional hostOrDomain field, the filter is applied for rules
> that have no hostOrDomain and for URL's that match the rule's host name and
> domain name.
> The following line enables hostOrDomain specific rules:
> {code}
> > www.example.org
> {code}
> The following line disables/resets it again:
> {code}
> <
> {code}
> full example:
> {code}
> -some generic filter
> +another generic filter
> > www.example.org
> -rule only applied to URL's of www.example.org
> +another rule only applied to URL's of www.example.org
> > apache.org
> -rule only applied to URL's of apache.org
> +another rule only applied to URL's of apache.org
> <
> -more generic rules
> +and another one
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)