Hi, I'm using nutch to crawl different intranet sites. The idea is to use the craw-urlfilter to tell the crawler to "stay" inside the seeded domain. I don't want it to follow links all around my intranet and crawl the same sites twice. This ideally means i'd have to rewrite the nutch-site.xml each time I launch the crawl. in an automated production system this is not possible, so the solution would be to have N different nutch directories just because of this. Do you have any suggestion?
TIA /CM -- Claudio Martella Digital Technologies Unit Research & Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [email protected] http://www.tis.bz.it
