Hello Markus and thank you for the quick answer! I didn't see that URLFilters were referenced directly in IndexerMapReduce. That's really powerful! Thanks!
2014-09-26 13:17 GMT+02:00 Markus Jelsma <[email protected]>: > Hi - you can use different regex files at the indexing stage, see > nutch-default for the configuration directive and use -Dparam=val to override > the default regex-urlfilter.txt file at indexing stage. > Markus > > > > -----Original message----- >> From:Albinscode <[email protected]> >> Sent: Friday 26th September 2014 11:25 >> To: [email protected] >> Subject: Url post filtering >> >> Hello everybody, >> >> I'm used to filter urls before fetch operation by using regex-filter >> to avoid crawling the world wide web. >> >> I've got a specific need: one main page giving all urls to crawl. I >> want to crawl the main page to have outlinks but I dont want to index >> this page. How can I proceed? >> >> I could enable this feature in my specific plugin but I want to be >> sure nothing is already existing as ever ;) >> Dirty solution would be to delete this main page url in the generated >> solr index with a json query but yeah this is really dirty ;) >> >> Hope I'm clear. >>

