[
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783563#comment-17783563
]
ASF GitHub Bot commented on NUTCH-3025:
---------------------------------------
jnioche commented on PR #796:
URL: https://github.com/apache/nutch/pull/796#issuecomment-1798221743
Writing a test for this thing is an absolute pain. The way the filters are
used for real is that their method setConf is called and the rules are loaded
using _getConfResourceAsReader_, i.e. they are expected to be in the jar.
The tests do not rely on that mechanism and instead instantiate the filter
with the reader for its rules. This means that the conf is not used at all and
therefore we can't use that to load the value for the length based filters. I
will add another constructor with the reader + conf so that we can test based
on the length.
> urlfilter-fast to filter based on the length of the URL
> -------------------------------------------------------
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.19
> Reporter: Julien Nioche
> Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could
> implement it in _urlfilter-fast _
--
This message was sent by Atlassian Jira
(v8.20.10#820010)