do the steps manually as described here:

http://wiki.apache.org/nutch/SimpleMapReduceTutorial




Am 21.12.2005 um 13:01 schrieb Arun Kaundal:

Hi Giang
But If I want to run the crawlTool manually after say each hour. It throw an error like Crawl directory already exist. If I comment this statement, I will get number of errors like this.... Directory alreday exist. What shoul
I do ...
   plz show me way out...


On 12/20/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:

The scheme of intranet crawling is like this: Firstly, you create a webdb
using WebDBAdminTool. After that, you fetch a seed URL using
WebDBInjector.
The seed URL is inserted into your webdb, marked by current date and time. Then, you create a fetch list using FetchListTool. The FetchListTool read
all URLs in the webdb which are due to crawl, and put them to the
fetchlist.
Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling
is finished, UpdateDatabaseTool extracts all outlinks and put them to
webdb.
Newly extracted outlinks are set date and time to current date and time, while all just-crawled URLs date and time are set to next 30 days (these things happen actually in FetchListTool). So all extracted links will be crawled for the next time, but not the just-crawled URLs. So on and so
forth.

Therefore, once the crawler is still alive after 30 days (or the threshold
that you set), all "just-crawled" urls will be taken out to recrawl.
That's
why we need to maintain a live crawler at that time. This could be done
using cron job, I think.

Regards,
  Giang



On 12/20/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:

Hi Nguyen,

Thank you for you information, but I would like to confirm that. I do
see
a
variable that define the next fetch interval but I am not sure of it. If
anyone has more information on this regard please let me know.

Thank you in advance,




On 12/19/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:

As I understand, by default, all links in Nutch are recrawled after 30
days, as long as your Nutch process is still running. FetchListTool
takes
care of this setting. So maybe you can write a script (and put it in
cron?)
to reactivate the crawler.

Regards,
  Giang


On 12/19/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:

Hi everyone,

I have browsed through the nutch documentation but I have not found
enough
information on how to recrawl the urls that I have already crawled.
Do
we
have to do a recrawling ourselves or the nutch application will do
it?

More information on this regard will be highly appreciated. Thank
you
very
much.

--
Keep on smiling :) Kumar






--
Keep on smiling :) Kumar







-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to