I solved my problem, and I'm going to post the solution here in case someone else has the same issue. My mistake was the last execution of step 6 (updatedb). In fact this operation fills the webdb with new urls extracted from the previous iteration, so that they are fetched by the next fetch instruction, along with the latest injected url (http://localhost/yyy/).
Hope this explanation is clear enough... :-P On 1/24/06, Ennio Tosi <[EMAIL PROTECTED]> wrote: > Thank you for the tip, I still can't solve my problem. > Let me explain in more details what I'm doing... > > 1. I created a file called 'urls.txt'. Put one url in it (e.g. > http://localhost/xxx/) > 2. nutch admin db -create > 3. nutch inject db urls.txt > 4. nutch generate db segments > 5. nutch fetch segments/<latest_segment> > 6. nutch updatedb db segments/<latest_segment> > > After repeating for, say, 2-3 times steps 4-6 and creating the index I then > run: > > * nutch inject db new_urls.txt (new_urls.txt contains something like > http://localhost/yyy/) > * nutch generate db segments > * nutch fetch segments/<latest_segment> > > The fetcher still downloads urls from http://localhost/xxx/ (along > with those from http://localhost/yyy/), even if there are no links > between the two sites. > > I can understand why it is behaving this way: I think the last > 'generate' instruction takes all outgoing links from the latest > segment, isn't it? > But how can I 'force' nutch to consider only outgoing links from the > newly injected url? > A regex-urlfilter won't solve my problem, since this is a very simple > example and not a real production scenario... > > Thank you in advance, > Ennio > > On 1/24/06, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote: > > If your "old urls" have not expired(30 day) then a bin/nutch generate > > will process only the new urls. > > > > > > > > Ennio Tosi wrote: > > > > >Hi, I created an index from an injected url. My problem is that if now > > >I inject another url in the webdb, the fetcher reprocesses the > > >starting url too... Is there a way to tell nutch to only process the > > >latest injected resource? > > > > > >Thanks, > > >Ennio > > > > > > > > > > > > > > > > >
