its not only confusing me, its also confusing the author, FrankMcCown, of the nutch tutorial
http://wiki.apache.org/nutch/NutchTutorial Crawl Command: Configuration To configure things for the crawl command you must: * Create a directory with a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls/nutch containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch file would thus contain: http://lucene.apache.org/nutch/ * Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read: +^http://([a-z0-9]*\.)*apache.org/ This will include any url in the domain apache.org. * Until someone could explain this...When I use the file crawl-urlfilter.txt the filter doesn't work, instead of it use the file conf/regex-urlfilter.txt and change the last line from "+." to "-." reinhard schwab schrieb: > i have tried the recrawl script of susam pal and have wondered why > url filtering no longer works. > http://wiki.apache.org/nutch/Crawl > > the mystery is > > only Crawl.java adds crawl-tool.xml to the NutchConfiguration. > > Configuration conf = NutchConfiguration.create(); > conf.addResource("crawl-tool.xml"); > > Fetcher.java and all the other tools which filter the outlinks do not > add this. > this is really confusing me and i have spent some time to figure this out. > > regards > reinhard > > > > > > > > >
