[Nutch-general] Re: How to recrawl urls

Arun Kaundal Wed, 21 Dec 2005 04:03:16 -0800

Hi Giang
   But If I want to run the crawlTool manually after say each hour. It throw
an error like Crawl directory already exist. If I comment this statement, I
will get number of errors like this.... Directory alreday exist. What shoul
I do ...
   plz show me way out...



On 12/20/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
>
> The scheme of intranet crawling is like this: Firstly, you create a webdb
> using WebDBAdminTool. After that, you fetch a seed URL using
> WebDBInjector.
> The seed URL is inserted into your webdb, marked by current date and time.
> Then, you create a fetch list using FetchListTool. The FetchListTool read
> all URLs in the webdb which are due to crawl, and put them to the
> fetchlist.
> Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling
> is finished, UpdateDatabaseTool extracts all outlinks and put them to
> webdb.
> Newly extracted outlinks are set date and time to current date and time,
> while all just-crawled URLs date and time are set to next 30 days (these
> things happen actually in FetchListTool). So all extracted links will be
> crawled for the next time, but not the just-crawled URLs. So on and so
> forth.
>
> Therefore, once the crawler is still alive after 30 days (or the threshold
> that you set), all "just-crawled" urls will be taken out to recrawl.
> That's
> why we need to maintain a live crawler at that time. This could be done
> using cron job, I think.
>
> Regards,
>   Giang
>
>
>
> On 12/20/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:
> >
> > Hi Nguyen,
> >
> > Thank you for you information, but I would like to confirm that. I do
> see
> > a
> > variable that define the next fetch interval but I am not sure of it. If
> > anyone has more information on this regard please let me know.
> >
> > Thank you in advance,
> >
> >
> >
> >
> > On 12/19/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
> > >
> > > As I understand, by default, all links in Nutch are recrawled after 30
> > > days, as long as your Nutch process is still running. FetchListTool
> > takes
> > > care of this setting. So maybe you can write a script (and put it in
> > > cron?)
> > > to reactivate the crawler.
> > >
> > > Regards,
> > >   Giang
> > >
> > >
> > > On 12/19/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > I have browsed through the nutch documentation but I have not found
> > > enough
> > > > information on how to recrawl the urls that I have already crawled.
> Do
> > > we
> > > > have to do a recrawling ourselves or the nutch application will do
> it?
> > > >
> > > > More information on this regard will be highly appreciated. Thank
> you
> > > very
> > > > much.
> > > >
> > > > --
> > > > Keep on smiling :) Kumar
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Keep on smiling :) Kumar
> >
> >
>
>

[Nutch-general] Re: How to recrawl urls

Reply via email to