[ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783533#comment-17783533
 ] 

ASF GitHub Bot commented on NUTCH-3025:
---------------------------------------

sebastian-nagel commented on code in PR #796:
URL: https://github.com/apache/nutch/pull/796#discussion_r1384536930


##########
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java:
##########
@@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter {
 
   private Configuration conf;
   public static final String URLFILTER_FAST_FILE = "urlfilter.fast.file";
+  public static final String URLFILTER_FAST_PATH_MAX_LENGTH = 
"urlfilter.fast.url.path.max.length";
+  public static final String URLFILTER_FAST_QUERY_MAX_LENGTH = 
"urlfilter.fast.url.query.max.length";
+  

Review Comment:
   What about adding a third limit for path and query combined 
([URL.getFile()](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html#getFile())?
   - if somebody defined two generous but reasonable limits (for example, 2048) 
for both path and query, the resulting URL may still get quite long and cause 
troubles 
   - also the HTTP GET request includes both path and query
   - for many use cases it should sufficient to just set this limit





> urlfilter-fast to filter based on the length of the URL
> -------------------------------------------------------
>
>                 Key: NUTCH-3025
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3025
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.19
>            Reporter: Julien Nioche
>            Priority: Major
>             Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to