[ 
https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249803#comment-13249803
 ] 

Yangxiaolong commented on NUTCH-366:
------------------------------------

Hello,I'm also interested in this issue.

I have submit an proposal in the gsoc.
link 
here:http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/yangxiaolong/1
 

My initial solution:

(1)The issue's reporter think  that make URLFilters and URLNormalizers 
implement the same interface, but I think we should make interface (URLFilter 
and URLNormailzer) extends the same interface,then we achieve an class 
URLManglingers uses a "chained mangling" pattern to run defined normalizers and 
filters.so you can maximize not change the existing code,such as:

older code:

 

public interface URLNormalizer extends Configurable {

  /* Extension ID */

  public static final String X_POINT_ID = URLNormalizer.class.getName();  

  /* Interface for URL normalization */

  public String normalize(String urlString, String scope) throws 
MalformedURLException;}

new code:

public interface URLMangling extends Configurable{

public String mangling(String[] args);}

public interface URLNormailzer extends URLMangling{

 

 /* Extension ID */

  public static final String X_POINT_ID = URLNormalizer.class.getName();  

  /* Interface for URL normalization */

  public String normalize(String urlString, String scope) throws 
MalformedURLException;}

 
 

 

(2) use a property "url.mangling.order",  this property is not used to 
determine which is first used,but used to determine whether to use the new 
code, for example:

older code:

if (urlNormalizers) {

      try {

        url = normalizers.normalize (url, scope); / / normalize the url

      } Catch (Exception e) {

        LOG.warn ("Skipping" + url + ":" + e);

        url = null;

      }

    }

    if (url! = null && urlFiltering) {

      try {

        url = filters.filter (url); / / filter the url

      } Catch (Exception e) {

        LOG.warn ("Skipping" + url + ":" + e);

        url = null;

      }

    }

 

the new code:

if (urlmangling == null) {

   if (urlNormalizers) {

      try {

        url = normalizers.normalize (url, scope); / / normalize the url

      } Catch (Exception e) {

        LOG.warn ("Skipping" + url + ":" + e);

        url = null;

      }

    }

    if (url! = null && urlFiltering) {

      try {

        url = filters.filter (url); / / filter the url

      } Catch (Exception e) {

        LOG.warn ("Skipping" + url + ":" + e);

        url = null;

      }

    }

}

else {...} / / We can make filter and normalizer run in accordance with our 
definition of the order


                
> Merge URLFilters and URLNormalizers
> -----------------------------------
>
>                 Key: NUTCH-366
>                 URL: https://issues.apache.org/jira/browse/NUTCH-366
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>              Labels: gsoc2012
>
> Currently Nutch uses two subsystems related to url validation and 
> normalization:
> * URLFilter: this interface checks if URLs are valid for further processing. 
> Input URL is not changed in any way. The output is a boolean value.
> * URLNormalizer: this interface brings URLs to their base ("normal") form, or 
> removes unneeded URL components, or performs any other URL mangling as 
> necessary. Input URLs are changed, and are returned as result.
> However, various Nutch tools run filters and normalizers in pre-determined 
> order, i.e. normalizers first, and then filters. In some cases, where 
> normalizers are complex and running them is costly (e.g. numerous regex 
> rules, DNS lookups) it would make sense to run some of the filters first 
> (e.g. prefix-based filters that select only certain protocols, or 
> suffix-based filters that select only known "extensions"). This is currently 
> not possible - we always have to run normalizers, only to later throw away 
> urls because they failed to pass through filters.
> I would like to solicit comments on the following two solutions, and work on 
> implementation of one of them:
> 1) we could make URLFilters and URLNormalizers implement the same interface, 
> and basically make them interchangeable. This way users could configure their 
> order arbitrarily, even mixing filters and normalizers out of order. This is 
> more complicated, but gives much more flexibility - and NUTCH-365 already 
> provides sufficient framework to implement this, including the ability to 
> define different sequences for different steps in the workflow.
> 2) we could use a property "url.mangling.order" ;) to define whether 
> normalizers or filters should run first. This is simple to implement, but 
> provides only limited improvement - because either all filters or all 
> normalizers would run, they couldn't be mixed in arbitrary order.
> Any comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to