[Nutch-general] RE: Introduction to Nutch, Part 1: Crawling

Andy Morris Thu, 12 Jan 2006 09:36:01 -0800

Okay so do you run the command bin/nutch generate -dir somedirectory or
what..
Do you have to be in the original crawl directory?
Andy


-----Original Message-----
From: Thomas Sondergaard [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 12, 2006 8:11 AM
To: [email protected]
Subject: Re: Introduction to Nutch, Part 1: Crawling

Is it safe to run these commands while the searcher (web-interface) is
using it? In other words can I just do the following:

1) crawl
2) start tomcat
3) setup a cron-job that runs the following commands every 5 days (for
my intranet I don't want to be up to 30 days behind): 1. generate, 2. 
updatedb, 3. invertlinks, 4. index, 5. dedup, 6. merge
4) Sit back and enjoy my eternally up-to-date intranet search engine?

Thanks,

Thomas


Gal Nitzan wrote:

>The crawl tool can be used only once.
>
>After running the initial crawl you can not use this tool again.
>
>>From that point on you would run:
>
>1. generate
>2. updatedb
>3. invertlinks
>4. index
>5. dedup
>6. merge
>
>The default parameter for fetching pages cycle is 30 days.
>
>So basically if you finished crawling your intranet in the initial 
>crawl you would run your generate in 30 days.
>
>However you can run the generate with the -adddays parameter set to 30 
>and it will generate a fetchlist with all pages already in your
crawldb.
>
>If your system contains new pages, the crawler would find it during the

>fetch and would update the crawldb.
>
>G.
>
>On Thu, 2006-01-12 at 07:44 -0500, Andy Morris wrote:
>  
>
>>After doing an initial crawl how do you keep that directory current.
>>How often should a intranet crawl be run.  Should this be a cron job 
>>and do I have to restart tomcat after each crawl?
>>
>>Andy
>>-----Original Message-----
>>From: Tom White [mailto:[EMAIL PROTECTED]
>>Sent: Wednesday, January 11, 2006 4:21 AM
>>To: [email protected]
>>Subject: Introduction to Nutch, Part 1: Crawling
>>
>>Hi,
>>
>>I've written an article about using Nutch at the intranet scale, which

>>you may find interesting:
>>http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.h
>>tm
>>l .
>>Please post any comments on the article page itself.
>>
>>I've updated the wiki to link to it too.
>>
>>Regards,
>>
>>Tom
>>
>>    
>>
>
>
>  
>



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] RE: Introduction to Nutch, Part 1: Crawling

Reply via email to