[ 
https://issues.apache.org/jira/browse/NUTCH-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141783#comment-16141783
 ] 

Marcos Bori commented on NUTCH-2413:
------------------------------------

Hi [~wastl-nagel],

Thanks for clarifying. You are right: in my code proposal I was applying 
"parse.filter.urls" when there is a redirection as the fetch result.
However, I was also applying it when the filters are applied on the resulting 
outlinks: in class FetcherThread, method output(), the urlfilters are applied 
in the outlinks resulting out of the parsing:

// Process all outlinks, normalize, filter and deduplicate
List<Outlink> outlinkList = new ArrayList<>(outlinksToStore);
HashSet<String> outlinks = new HashSet<>(outlinksToStore);
for (int i = 0; i < links.length && validCount < outlinksToStore; i++) {
        String toUrl = links[i].getToUrl();

        toUrl = ParseOutputFormat.filterNormalize(url.toString(), toUrl,
                        origin, ignoreInternalLinks, ignoreExternalLinks, 
ignoreExternalLinksMode,
                                        urlFilters, urlExemptionFilters,  
normalizers);
        if (toUrl == null) {
                continue;
        }

        validCount++;
        links[i].setUrl(toUrl);
        outlinkList.add(links[i]);
        outlinks.add(toUrl);
}

In order to have an equivalent behaviour when we fetch and parse altogether, or 
when we do it separately, and if I'm not wrong, at this point we should be 
avoid executing the filters if "parse.filter.urls" is false (and normalizers if 
"parse.normalize.urls" is false, as well).

In fact, ParseOutputFormat is applying the filters at two points:
        (1) when pstatus.getMinorCode() is ParseStatus.SUCCESS_REDIRECT, it is 
applied in the redirection URL
        (2) when the parse succeeds, the filters are applied in all outlinks
But the second (2) case is only executed if "fetcher.parse" is false (that is, 
when we are executing fetch and parse separately), because when "fetcher.parse" 
is true, the filtering is applied in FetcherThread::output(), as exposed before.

I'm posting a new pull request with the modifications according to this.

Am I right?
        

> When fetching and parsing together, parameter "parse.filter.urls" is ignored
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-2413
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2413
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, parser
>    Affects Versions: 1.13
>         Environment: Apache Nutch release 1.13.
>            Reporter: Marcos Bori
>             Fix For: 1.14
>
>
> In a situation when we want to:
> (1) Execute the fetch and parse together ("fetcher.parse" setting to "true")
> (2) Avoid applying the URL filters when executing this phase.
> Condition (2) can be configured when parsing is executed as a separate 
> process by setting "parse.filter.urls" to "false".
> However, this setting ("parse.filter.urls") is ignored when we execute the 
> fetch and parse phases together. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to