[ 
https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237741#comment-13237741
 ] 

Apurv Verma commented on NUTCH-366:
-----------------------------------

I did some reading of the codebase and your comments, here is what I have come 
up with. Please correct me if I am wrong.

In the CrawlDbFilter class

please see line 87. Here we first normalize and then filter.

{code}
    if (urlNormalizers) {
      try {
        url = normalizers.normalize(url, scope); // normalize the url
      } catch (Exception e) {
        LOG.warn("Skipping " + url + ":" + e);
        url = null;
      }
    }
    if (url != null && urlFiltering) {
      try {
        url = filters.filter(url); // filter the url
      } catch (Exception e) {
        LOG.warn("Skipping " + url + ":" + e);
        url = null;
      }
    }

{code}

My solution:

A Quick Hack:
Put the URLFilter inside URLNormalizer. After all, URLFilter is also a kind of 
normalizer. The external world only calls normalize. 
normalize() itself is first going to normalize and then call filter upon it.
If this solution works then the code can be cleaned up by restructuring the 
entire nutch.net package.

Isn't the total time complexity same theoretically?

anything I have missed or any correction?
                
> Merge URLFilters and URLNormalizers
> -----------------------------------
>
>                 Key: NUTCH-366
>                 URL: https://issues.apache.org/jira/browse/NUTCH-366
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>              Labels: gsoc2012
>
> Currently Nutch uses two subsystems related to url validation and 
> normalization:
> * URLFilter: this interface checks if URLs are valid for further processing. 
> Input URL is not changed in any way. The output is a boolean value.
> * URLNormalizer: this interface brings URLs to their base ("normal") form, or 
> removes unneeded URL components, or performs any other URL mangling as 
> necessary. Input URLs are changed, and are returned as result.
> However, various Nutch tools run filters and normalizers in pre-determined 
> order, i.e. normalizers first, and then filters. In some cases, where 
> normalizers are complex and running them is costly (e.g. numerous regex 
> rules, DNS lookups) it would make sense to run some of the filters first 
> (e.g. prefix-based filters that select only certain protocols, or 
> suffix-based filters that select only known "extensions"). This is currently 
> not possible - we always have to run normalizers, only to later throw away 
> urls because they failed to pass through filters.
> I would like to solicit comments on the following two solutions, and work on 
> implementation of one of them:
> 1) we could make URLFilters and URLNormalizers implement the same interface, 
> and basically make them interchangeable. This way users could configure their 
> order arbitrarily, even mixing filters and normalizers out of order. This is 
> more complicated, but gives much more flexibility - and NUTCH-365 already 
> provides sufficient framework to implement this, including the ability to 
> define different sequences for different steps in the workflow.
> 2) we could use a property "url.mangling.order" ;) to define whether 
> normalizers or filters should run first. This is simple to implement, but 
> provides only limited improvement - because either all filters or all 
> normalizers would run, they couldn't be mixed in arbitrary order.
> Any comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to