[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783563#comment-17783563 ]
ASF GitHub Bot commented on NUTCH-3025: --------------------------------------- jnioche commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1798221743 Writing a test for this thing is an absolute pain. The way the filters are used for real is that their method setConf is called and the rules are loaded using _getConfResourceAsReader_, i.e. they are expected to be in the jar. The tests do not rely on that mechanism and instead instantiate the filter with the reader for its rules. This means that the conf is not used at all and therefore we can't use that to load the value for the length based filters. I will add another constructor with the reader + conf so that we can test based on the length. > urlfilter-fast to filter based on the length of the URL > ------------------------------------------------------- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.19 > Reporter: Julien Nioche > Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)