Thank you for the tip, I still can't solve my problem.
Let me explain in more details what I'm doing...

1. I created a file called 'urls.txt'. Put one url in it (e.g.
http://localhost/xxx/)
2. nutch admin db -create
3. nutch inject db urls.txt
4. nutch generate db segments
5. nutch fetch segments/<latest_segment>
6. nutch updatedb db segments/<latest_segment>

After repeating for, say, 2-3 times steps 4-6 and creating the index I then run:

* nutch inject db new_urls.txt (new_urls.txt contains something like
http://localhost/yyy/)
* nutch generate db segments
* nutch fetch segments/<latest_segment>

The fetcher still downloads urls from http://localhost/xxx/ (along
with those from http://localhost/yyy/), even if there are no links
between the two sites.

I can understand why it is behaving this way: I think the last
'generate' instruction takes all outgoing links from the latest
segment, isn't it?
But how can I 'force' nutch to consider only outgoing links from the
newly injected url?
A regex-urlfilter won't solve my problem, since this is a very simple
example and not a real  production scenario...

Thank you in advance,
Ennio

On 1/24/06, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote:
> If your "old urls" have not expired(30 day) then a bin/nutch generate
> will process only the new urls.
>
>
>
> Ennio Tosi wrote:
>
> >Hi, I created an index from an injected url. My problem is that if now
> >I inject another url in the webdb, the fetcher reprocesses the
> >starting url too... Is there a way to tell nutch to only process the
> >latest injected resource?
> >
> >Thanks,
> >Ennio
> >
> >
> >
> >
>
>

Reply via email to