[Nutch-general] Re: How to recrawl urls

Stefan Groschupf Wed, 21 Dec 2005 04:47:05 -0800

do the steps manually as described here:

http://wiki.apache.org/nutch/SimpleMapReduceTutorial





Am 21.12.2005 um 13:01 schrieb Arun Kaundal:

Hi Giang
But If I want to run the crawlTool manually after say each hour.It throwan error like Crawl directory already exist. If I comment thisstatement, Iwill get number of errors like this.... Directory alreday exist.What shoul
I do ...
   plz show me way out...


On 12/20/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
The scheme of intranet crawling is like this: Firstly, you createa webdb
using WebDBAdminTool. After that, you fetch a seed URL using
WebDBInjector.
The seed URL is inserted into your webdb, marked by current dateand time.Then, you create a fetch list using FetchListTool. TheFetchListTool read
all URLs in the webdb which are due to crawl, and put them to the
fetchlist.
Next, the Fetcher crawls all URLs in the fetchlist. Finally, oncecrawling
is finished, UpdateDatabaseTool extracts all outlinks and put them to
webdb.
Newly extracted outlinks are set date and time to current date andtime,while all just-crawled URLs date and time are set to next 30 days(thesethings happen actually in FetchListTool). So all extracted linkswill becrawled for the next time, but not the just-crawled URLs. So onand so
forth.
Therefore, once the crawler is still alive after 30 days (or thethreshold
that you set), all "just-crawled" urls will be taken out to recrawl.
That's
why we need to maintain a live crawler at that time. This could bedone
using cron job, I think.

Regards,
  Giang



On 12/20/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:
Hi Nguyen,
Thank you for you information, but I would like to confirm that.I do
see
a
variable that define the next fetch interval but I am not sure ofit. If
anyone has more information on this regard please let me know.

Thank you in advance,




On 12/19/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
As I understand, by default, all links in Nutch are recrawledafter 30
days, as long as your Nutch process is still running. FetchListTool
takes
care of this setting. So maybe you can write a script (and putit in
cron?)
to reactivate the crawler.

Regards,
  Giang


On 12/19/05, Kumar Limbu <[EMAIL PROTECTED]> wrote:
Hi everyone,
I have browsed through the nutch documentation but I have notfound
enough
information on how to recrawl the urls that I have alreadycrawled.
Do
we
have to do a recrawling ourselves or the nutch application will do
it?
More information on this regard will be highly appreciated. Thank
you
very
much.

--
Keep on smiling :) Kumar
--
Keep on smiling :) Kumar




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: How to recrawl urls

Reply via email to