[
https://issues.apache.org/jira/browse/NUTCH-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174819#comment-14174819
]
Sebastian Nagel commented on NUTCH-1877:
----------------------------------------
>From conf/suffix-urlfilter.txt.template:
{code}
# uncomment the line below to filter on url path
#+P
{code}
With {{+P}} MP3s are filtered properly. Maybe we should make this the default,
and add a proper description/warning. Matching on pure URL strings only may
have other strange results:
{noformat}
% cat test_NUTCH-1877.txt
http://xyz.com
http://xyz.com/music.mp3
http://xyz.com/music.mp3?q=abc
http://xyz.com/search?q=foo.com
% cat conf/suffix-urlfilter.txt
# mode is: ignore matched, allow unmatched URLs
+
# case-insensitive, allow unknown suffixes
+I
# filter only on URL path
#+P
# excluded suffixes
.com
.mp3
% cat test_NUTCH-1877.txt | bin/nutch org.apache.nutch.net.URLFilterChecker
-filterName org.apache.nutch.urlfilter.suffix.SuffixURLFilter
Checking combination of all URLFilters available
-http://xyz.com
-http://xyz.com/music.mp3
+http://xyz.com/music.mp3?q=abc
-http://xyz.com/search?q=foo.com
{noformat}
A user would hardly intend to skip {{xyz.com}} and {{...?q=foo.com}}. With
{{+P}} the behavior is more "intuitive":
{noformat}
+http://xyz.com
-http://xyz.com/music.mp3
-http://xyz.com/music.mp3?q=abc
+http://xyz.com/search?q=foo.com
{noformat}
> Suffix URL filter doesn't ignore query strings
> ----------------------------------------------
>
> Key: NUTCH-1877
> URL: https://issues.apache.org/jira/browse/NUTCH-1877
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.9
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.10
>
>
> Suffix URL filter entry: .mp3
> Does not filter out: http://www.example.org/file.mp3?a=b
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)