Hi,

I'm using nutch to crawl different intranet sites. The idea is to use
the craw-urlfilter to tell the crawler to "stay" inside the seeded
domain. I don't want it to follow links all around my intranet and crawl
the same sites twice. This ideally means i'd have to rewrite the
nutch-site.xml each time I launch the crawl. in an automated production
system this is not possible, so the solution would be to have N
different nutch directories just because of this. Do you have any
suggestion?

TIA


/CM

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
[email protected] http://www.tis.bz.it

Reply via email to