do the steps manually as described here:
http://wiki.apache.org/nutch/SimpleMapReduceTutorial
Am 21.12.2005 um 13:01 schrieb Arun Kaundal:
Hi Giang
But If I want to run the crawlTool manually after say each hour.
It throw
an error like Crawl directory already exist. If I comment this
statement, I
will get number of errors like this.... Directory alreday exist.
What shoul
I do ...
plz show me way out...
On 12/20/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
The scheme of intranet crawling is like this: Firstly, you create
a webdb
using WebDBAdminTool. After that, you fetch a seed URL using
WebDBInjector.
The seed URL is inserted into your webdb, marked by current date
and time.
Then, you create a fetch list using FetchListTool. The
FetchListTool read
all URLs in the webdb which are due to crawl, and put them to the
fetchlist.
Next, the Fetcher crawls all URLs in the fetchlist. Finally, once
crawling
is finished, UpdateDatabaseTool extracts all outlinks and put them to
webdb.
Newly extracted outlinks are set date and time to current date and
time,
while all just-crawled URLs date and time are set to next 30 days
(these
things happen actually in FetchListTool). So all extracted links
will be
crawled for the next time, but not the just-crawled URLs. So on
and so
forth.
Therefore, once the crawler is still alive after 30 days (or the
threshold
that you set), all "just-crawled" urls will be taken out to recrawl.
That's
why we need to maintain a live crawler at that time. This could be
done
using cron job, I think.
Regards,
Giang
On 12/20/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:
Hi Nguyen,
Thank you for you information, but I would like to confirm that.
I do
see
a
variable that define the next fetch interval but I am not sure of
it. If
anyone has more information on this regard please let me know.
Thank you in advance,
On 12/19/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
As I understand, by default, all links in Nutch are recrawled
after 30
days, as long as your Nutch process is still running. FetchListTool
takes
care of this setting. So maybe you can write a script (and put
it in
cron?)
to reactivate the crawler.
Regards,
Giang
On 12/19/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:
Hi everyone,
I have browsed through the nutch documentation but I have not
found
enough
information on how to recrawl the urls that I have already
crawled.
Do
we
have to do a recrawling ourselves or the nutch application will do
it?
More information on this regard will be highly appreciated. Thank
you
very
much.
--
Keep on smiling :) Kumar
--
Keep on smiling :) Kumar
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general