[ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783563#comment-17783563
 ] 

ASF GitHub Bot commented on NUTCH-3025:
---------------------------------------

jnioche commented on PR #796:
URL: https://github.com/apache/nutch/pull/796#issuecomment-1798221743

   Writing a test for this thing is an absolute pain. The way the filters are 
used for real is that their method setConf is called and the rules are loaded 
using _getConfResourceAsReader_, i.e. they are expected to be in the jar.
   The tests do not rely on that mechanism and instead instantiate the filter 
with the reader for its rules. This means that the conf is not used at all and 
therefore we can't use that to load the value for the length based filters. I 
will add another constructor with the reader + conf so that we can test based 
on the length.




> urlfilter-fast to filter based on the length of the URL
> -------------------------------------------------------
>
>                 Key: NUTCH-3025
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3025
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.19
>            Reporter: Julien Nioche
>            Priority: Major
>             Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to