[ 
https://issues.apache.org/jira/browse/NUTCH-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141853#comment-16141853
 ] 

ASF GitHub Bot commented on NUTCH-2413:
---------------------------------------

sebastian-nagel commented on a change in pull request #216: fix for NUTCH-2413 
contributed by maborec
URL: https://github.com/apache/nutch/pull/216#discussion_r135302496
 
 

 ##########
 File path: src/java/org/apache/nutch/fetcher/FetcherThread.java
 ##########
 @@ -695,14 +696,23 @@ private ParseStatus output(Text key, CrawlDatum datum, 
Content content,
           int validCount = 0;
 
           // Process all outlinks, normalize, filter and deduplicate
+          
+          // NUTCH-2413 Apply filters or normalizers only if configured
+          URLFilters urlFiltersForOutlinks = null;
+          if (conf.getBoolean("parse.filter.urls", true))
+            urlFiltersForOutlinks = urlFilters;
+          URLNormalizers normalizersForOutlinks = null;
+          if (conf.getBoolean("parse.normalize.urls", true))
+            normalizersForOutlinks = normalizers;
+          
           List<Outlink> outlinkList = new ArrayList<>(outlinksToStore);
           HashSet<String> outlinks = new HashSet<>(outlinksToStore);
           for (int i = 0; i < links.length && validCount < outlinksToStore; 
i++) {
             String toUrl = links[i].getToUrl();
-
+                      
             toUrl = ParseOutputFormat.filterNormalize(url.toString(), toUrl,
                 origin, ignoreInternalLinks, ignoreExternalLinks, 
ignoreExternalLinksMode,
-                    urlFilters, urlExemptionFilters,  normalizers);
+                urlFiltersForOutlinks, urlExemptionFilters,  
normalizersForOutlinks);
 
 Review comment:
   Please, keep the indentation or use the [Eclipse code formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml).
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> When fetching and parsing together, parameter "parse.filter.urls" is ignored
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-2413
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2413
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, parser
>    Affects Versions: 1.13
>         Environment: Apache Nutch release 1.13.
>            Reporter: Marcos Bori
>             Fix For: 1.14
>
>
> In a situation when we want to:
> (1) Execute the fetch and parse together ("fetcher.parse" setting to "true")
> (2) Avoid applying the URL filters when executing this phase.
> Condition (2) can be configured when parsing is executed as a separate 
> process by setting "parse.filter.urls" to "false".
> However, this setting ("parse.filter.urls") is ignored when we execute the 
> fetch and parse phases together. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to