The scheme of intranet crawling is like this: Firstly, you create a webdb
using WebDBAdminTool. After that, you fetch a seed URL using WebDBInjector.
The seed URL is inserted into your webdb, marked by current date and time.
Then, you create a fetch list using FetchListTool. The FetchListTool read
all URLs in the webdb which are due to crawl, and put them to the fetchlist.
Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling
is finished, UpdateDatabaseTool extracts all outlinks and put them to webdb.
Newly extracted outlinks are set date and time to current date and time,
while all just-crawled URLs date and time are set to next 30 days (these
things happen actually in FetchListTool). So all extracted links will be
crawled for the next time, but not the just-crawled URLs. So on and so
forth.

  Therefore, once the crawler is still alive after 30 days (or the threshold
that you set), all "just-crawled" urls will be taken out to recrawl. That's
why we need to maintain a live crawler at that time. This could be done
using cron job, I think.

  Regards,
   Giang



On 12/20/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:
>
> Hi Nguyen,
>
> Thank you for you information, but I would like to confirm that. I do see
> a
> variable that define the next fetch interval but I am not sure of it. If
> anyone has more information on this regard please let me know.
>
> Thank you in advance,
>
>
>
>
> On 12/19/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
> >
> > As I understand, by default, all links in Nutch are recrawled after 30
> > days, as long as your Nutch process is still running. FetchListTool
> takes
> > care of this setting. So maybe you can write a script (and put it in
> > cron?)
> > to reactivate the crawler.
> >
> > Regards,
> >   Giang
> >
> >
> > On 12/19/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi everyone,
> > >
> > > I have browsed through the nutch documentation but I have not found
> > enough
> > > information on how to recrawl the urls that I have already crawled. Do
> > we
> > > have to do a recrawling ourselves or the nutch application will do it?
> > >
> > > More information on this regard will be highly appreciated. Thank you
> > very
> > > much.
> > >
> > > --
> > > Keep on smiling :) Kumar
> > >
> > >
> >
> >
>
>
> --
> Keep on smiling :) Kumar
>
>

Reply via email to