[ 
https://issues.apache.org/jira/browse/NUTCH-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174819#comment-14174819
 ] 

Sebastian Nagel commented on NUTCH-1877:
----------------------------------------

>From conf/suffix-urlfilter.txt.template:
{code}
# uncomment the line below to filter on url path
#+P
{code}

With {{+P}} MP3s are filtered properly. Maybe we should make this the default, 
and add a proper description/warning. Matching on pure URL strings only may 
have other strange results:
{noformat}
% cat test_NUTCH-1877.txt 
http://xyz.com
http://xyz.com/music.mp3
http://xyz.com/music.mp3?q=abc
http://xyz.com/search?q=foo.com

% cat conf/suffix-urlfilter.txt 
# mode is: ignore matched, allow unmatched URLs
+
# case-insensitive, allow unknown suffixes
+I
# filter only on URL path
#+P
# excluded suffixes
.com
.mp3

% cat test_NUTCH-1877.txt |  bin/nutch org.apache.nutch.net.URLFilterChecker 
-filterName org.apache.nutch.urlfilter.suffix.SuffixURLFilter
Checking combination of all URLFilters available
-http://xyz.com
-http://xyz.com/music.mp3
+http://xyz.com/music.mp3?q=abc
-http://xyz.com/search?q=foo.com
{noformat}

A user would hardly intend to skip {{xyz.com}} and {{...?q=foo.com}}. With 
{{+P}} the behavior is more "intuitive":
{noformat}
+http://xyz.com
-http://xyz.com/music.mp3
-http://xyz.com/music.mp3?q=abc
+http://xyz.com/search?q=foo.com
{noformat}


> Suffix URL filter doesn't ignore query strings
> ----------------------------------------------
>
>                 Key: NUTCH-1877
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1877
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.10
>
>
> Suffix URL filter entry: .mp3
> Does not filter out: http://www.example.org/file.mp3?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to