[ 
https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833919#comment-16833919
 ] 

ASF GitHub Bot commented on NUTCH-2690:
---------------------------------------

sebastian-nagel commented on pull request #433: NUTCH-2690 Configurable and 
fast URL filter
URL: https://github.com/apache/nutch/pull/433
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Configurable and fast URL filter
> --------------------------------
>
>                 Key: NUTCH-2690
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2690
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> This improvement introduces a new URL filter plugin "urlfilter-fast" (naming 
> debatable) which is in use at Common Crawl [since 
> 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
>  to apply a long list of filters. 
> # an exact (suffix) match against the host name is done to retrieve 
> host/domain-specific regex rules
> # applies a regular expression against the path (and query) component of the 
> URL
> What makes it faster than urlfilter-regex for common cases:
> - regexes are selected by host name or it's domain suffix, so there are 
> usually fewer rules to be checked. That's similar to NUTCH-1838 but any 
> domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or 
> {{.}} for global rules. The selection by host name suffix is considerably 
> fast.
> - regexes are applied only to the path component (optionally including the 
> query) and not the entire URL.
>   Matching against a shorter string can make a huge difference for more 
> complex regular expressions.
> - the rule to deny everything from a host or domain gets special treatment to 
> be fast
> More details about the rule format are found in the plugin's 
> [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to