Hi Giang But If I want to run the crawlTool manually after say each hour. It throw an error like Crawl directory already exist. If I comment this statement, I will get number of errors like this.... Directory alreday exist. What shoul I do ... plz show me way out...
On 12/20/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote: > > The scheme of intranet crawling is like this: Firstly, you create a webdb > using WebDBAdminTool. After that, you fetch a seed URL using > WebDBInjector. > The seed URL is inserted into your webdb, marked by current date and time. > Then, you create a fetch list using FetchListTool. The FetchListTool read > all URLs in the webdb which are due to crawl, and put them to the > fetchlist. > Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling > is finished, UpdateDatabaseTool extracts all outlinks and put them to > webdb. > Newly extracted outlinks are set date and time to current date and time, > while all just-crawled URLs date and time are set to next 30 days (these > things happen actually in FetchListTool). So all extracted links will be > crawled for the next time, but not the just-crawled URLs. So on and so > forth. > > Therefore, once the crawler is still alive after 30 days (or the threshold > that you set), all "just-crawled" urls will be taken out to recrawl. > That's > why we need to maintain a live crawler at that time. This could be done > using cron job, I think. > > Regards, > Giang > > > > On 12/20/05, Kumar Limbu <[EMAIL PROTECTED]> wrote: > > > > Hi Nguyen, > > > > Thank you for you information, but I would like to confirm that. I do > see > > a > > variable that define the next fetch interval but I am not sure of it. If > > anyone has more information on this regard please let me know. > > > > Thank you in advance, > > > > > > > > > > On 12/19/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote: > > > > > > As I understand, by default, all links in Nutch are recrawled after 30 > > > days, as long as your Nutch process is still running. FetchListTool > > takes > > > care of this setting. So maybe you can write a script (and put it in > > > cron?) > > > to reactivate the crawler. > > > > > > Regards, > > > Giang > > > > > > > > > On 12/19/05, Kumar Limbu <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi everyone, > > > > > > > > I have browsed through the nutch documentation but I have not found > > > enough > > > > information on how to recrawl the urls that I have already crawled. > Do > > > we > > > > have to do a recrawling ourselves or the nutch application will do > it? > > > > > > > > More information on this regard will be highly appreciated. Thank > you > > > very > > > > much. > > > > > > > > -- > > > > Keep on smiling :) Kumar > > > > > > > > > > > > > > > > > > > > -- > > Keep on smiling :) Kumar > > > > > >
