Has anyone ever gotten Nutch to fetch URLs which contain a special character (specifically '?')? I'm trying to crawl some URLs and want to fetch some (but only some) which contain question marks (such as http://www.giantfood.com/corporate/company_press_display.htm?press_id=380 -- contained in http://www.giantfood.com/corporate/company_press.htm), but they always seem to get skipped. I tried taking '?' out of the list of special characters to skip (see regex-urlfilter.txt) and that does let it grab some (http://www.giantfood.com/ntlinktrack?...), but I don't want those. I'm pretty sure it doesn't have anything to do with the links only being available with Javascript enabled, since I can go through a browser without JS and get them okay.
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(js|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

-^http://www.giantfood.com/cgi-bin/*
-^http://www.giantfood.com/locator/store_dsp_detail*
-^http://www.giantfood.com/aplus/aplus_school_directory.htm*
-^http://www.giantfood.com/careers/company_employapp.htm?posname=*
-^http://www.giantfood.com/pharmacy/header.htm
-^http://www.giantfood.com/pharmacy/sidebar.htm
-^http://www.giantfood.com/wine/header.htm
-^http://www.giantfood.com/wine/sidebar.htm
-^http://www.giantfood.com/foodguide/*
-^http://www.stopandshop.com/cgi-bin/*
-^http://www.stopandshop.com/rxrefill/ss-top.htm
-^http://www.stopandshop.com/rxrefill/ss-left.htm
-^http://www.stopandshop.com/rxrefill/ss-frame.htm
-^http://www.stopandshop.com/rxrefill/blank.htm
-^http://www.stopandshop.com/great_ideas/meal_solutions/top.htm
-^http://www.stopandshop.com/great_ideas/gift_cards/top.htm
-^http://www.stopandshop.com/UPR_SSWWeb/*
-^http://www.giantfood.com/ntlinktrack?=*
-^http://www.stopandshop.com/ntlinktrack?=*

+^http://www.giantfood.com/
+^http://www.stopandshop.com/
+^http://www.stopandshop.com/payvantage/
+^http://www.giantfood.com/corporate/company_press_display.htm?press_id=*
-.

Reply via email to