[
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783544#comment-17783544
]
ASF GitHub Bot commented on NUTCH-3025:
---------------------------------------
jnioche commented on code in PR #796:
URL: https://github.com/apache/nutch/pull/796#discussion_r1384621727
##########
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java:
##########
@@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter {
private Configuration conf;
public static final String URLFILTER_FAST_FILE = "urlfilter.fast.file";
+ public static final String URLFILTER_FAST_PATH_MAX_LENGTH =
"urlfilter.fast.url.path.max.length";
+ public static final String URLFILTER_FAST_QUERY_MAX_LENGTH =
"urlfilter.fast.url.query.max.length";
+
Review Comment:
I might keep things simple and just add a size limit on the whole URL
regardless of its parts, similar to [what is done in
StormCrawler.](https://github.com/DigitalPebble/storm-crawler/blob/ef31e509139cccb2919c345ef343c4fcfb2f1ec5/core/src/main/java/com/digitalpebble/stormcrawler/filtering/basic/BasicURLFilter.java#L30C17-L30C26)
> urlfilter-fast to filter based on the length of the URL
> -------------------------------------------------------
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.19
> Reporter: Julien Nioche
> Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could
> implement it in _urlfilter-fast _
--
This message was sent by Atlassian Jira
(v8.20.10#820010)