[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783533#comment-17783533 ]
ASF GitHub Bot commented on NUTCH-3025: --------------------------------------- sebastian-nagel commented on code in PR #796: URL: https://github.com/apache/nutch/pull/796#discussion_r1384536930 ########## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ########## @@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter { private Configuration conf; public static final String URLFILTER_FAST_FILE = "urlfilter.fast.file"; + public static final String URLFILTER_FAST_PATH_MAX_LENGTH = "urlfilter.fast.url.path.max.length"; + public static final String URLFILTER_FAST_QUERY_MAX_LENGTH = "urlfilter.fast.url.query.max.length"; + Review Comment: What about adding a third limit for path and query combined ([URL.getFile()](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html#getFile())? - if somebody defined two generous but reasonable limits (for example, 2048) for both path and query, the resulting URL may still get quite long and cause troubles - also the HTTP GET request includes both path and query - for many use cases it should sufficient to just set this limit > urlfilter-fast to filter based on the length of the URL > ------------------------------------------------------- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.19 > Reporter: Julien Nioche > Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)