[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783544#comment-17783544 ]
ASF GitHub Bot commented on NUTCH-3025: --------------------------------------- jnioche commented on code in PR #796: URL: https://github.com/apache/nutch/pull/796#discussion_r1384621727 ########## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ########## @@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter { private Configuration conf; public static final String URLFILTER_FAST_FILE = "urlfilter.fast.file"; + public static final String URLFILTER_FAST_PATH_MAX_LENGTH = "urlfilter.fast.url.path.max.length"; + public static final String URLFILTER_FAST_QUERY_MAX_LENGTH = "urlfilter.fast.url.query.max.length"; + Review Comment: I might keep things simple and just add a size limit on the whole URL regardless of its parts, similar to [what is done in StormCrawler.](https://github.com/DigitalPebble/storm-crawler/blob/ef31e509139cccb2919c345ef343c4fcfb2f1ec5/core/src/main/java/com/digitalpebble/stormcrawler/filtering/basic/BasicURLFilter.java#L30C17-L30C26) > urlfilter-fast to filter based on the length of the URL > ------------------------------------------------------- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.19 > Reporter: Julien Nioche > Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)