Thank you for the tip, I still can't solve my problem. Let me explain in more details what I'm doing...
1. I created a file called 'urls.txt'. Put one url in it (e.g. http://localhost/xxx/) 2. nutch admin db -create 3. nutch inject db urls.txt 4. nutch generate db segments 5. nutch fetch segments/<latest_segment> 6. nutch updatedb db segments/<latest_segment> After repeating for, say, 2-3 times steps 4-6 and creating the index I then run: * nutch inject db new_urls.txt (new_urls.txt contains something like http://localhost/yyy/) * nutch generate db segments * nutch fetch segments/<latest_segment> The fetcher still downloads urls from http://localhost/xxx/ (along with those from http://localhost/yyy/), even if there are no links between the two sites. I can understand why it is behaving this way: I think the last 'generate' instruction takes all outgoing links from the latest segment, isn't it? But how can I 'force' nutch to consider only outgoing links from the newly injected url? A regex-urlfilter won't solve my problem, since this is a very simple example and not a real production scenario... Thank you in advance, Ennio On 1/24/06, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote: > If your "old urls" have not expired(30 day) then a bin/nutch generate > will process only the new urls. > > > > Ennio Tosi wrote: > > >Hi, I created an index from an injected url. My problem is that if now > >I inject another url in the webdb, the fetcher reprocesses the > >starting url too... Is there a way to tell nutch to only process the > >latest injected resource? > > > >Thanks, > >Ennio > > > > > > > > > >
