I am using nutch 0.7.2. I would like to crawl a certain section of a website...that is http://domain.com/ID1124 http://domain.com/ID22351 http://domain.com/ID546 and so on....
I tried feeding in just this line: http://domain.com/ID* (added it in url.txt and fed that file)...that didn't work. It will be difficult to generate a list of IDs from the website and feed that static list to nutch. Does nutch accept wildcard in the urls? If so, how can I get it working? If not, are there any work-arounds? My crawl-filter works well. I just passed in http://domain.com/ID546 and was able to retrieve that page. Thanks. -- View this message in context: http://www.nabble.com/wildcard-urls-tf4251600.html#a12100349 Sent from the Nutch - User mailing list archive at Nabble.com.
