[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

ASF GitHub Bot (JIRA) Sat, 13 Feb 2016 15:52:13 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146262#comment-15146262
 ]


ASF GitHub Bot commented on NUTCH-2144:
---------------------------------------

Github user thammegowda commented on a diff in the pull request:

    https://github.com/apache/nutch/pull/89#discussion_r52835493
  
    --- Diff: 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
 ---
    @@ -0,0 +1,144 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.nutch.urlfilter.ignoreexempt;
    +
    +import org.apache.commons.io.IOUtils;
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.nutch.net.URLExemptionFilter;
    +import org.apache.nutch.util.NutchConfiguration;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.IOException;
    +import java.io.InputStream;
    +import java.util.Arrays;
    +import java.util.regex.Pattern;
    +import java.util.List;
    +import java.util.ArrayList;
    +
    +
    +/**
    + * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} 
uses regex configuration
    + * to check if URL is eligible for exemption from 'db.ignore.external'.
    + * When this filter is enabled, urls will be checked against configured 
sequence of regex rules.
    + *<p>
    + * The exemption rule file defaults to db-ignore-external-exemptions.txt 
in the classpath but can be
    + * overridden using the property  
<code>"db.ignore.external.exemptions.file" in ./conf/nutch-*.xml</code>
    + *</p>
    + *
    + * The exemption rules are specified in plain text file where each line is 
a rule.
    + *
    + * When the url matches regex it is exempted from 
'db.ignore.external...'<br/>
    + * <h3>Examples:</h3>
    + * <ol>
    + *   <li>
    + *     <b>Exempt urls ending with .jpg or .png or gif</b>
    + *      <br/><code>.*\.(jpg|JPG|png$|PNG|gif|GIF)$</code><br/><br/>
    + *   </li>
    + * </ol>
    + </pre>
    + *
    + * @since Feb 10, 2016
    + * @version 1
    + * @see URLExemptionFilter
    + */
    +public class ExemptionUrlFilter implements URLExemptionFilter {
    +
    +  public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = 
"db.ignore.external.exemptions.file";
    +  private static final Logger LOG = 
LoggerFactory.getLogger(ExemptionUrlFilter.class);
    +  private static ExemptionUrlFilter INSTANCE;
    +
    +  private List<Pattern> exemptions;
    +  private Configuration conf;
    +  private boolean enabled;
    +
    +  public static ExemptionUrlFilter getInstance() {
    --- End diff --
    
    :+1: 
    
    Agreed.


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2144
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb, fetcher
>            Reporter: Thamme Gowda N
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.12
>
>         Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

Reply via email to