Re: 'crawled already exists' - how do I recrawl?

nghianghesi Tue, 15 Jan 2008 08:21:01 -0800

the script work well (Nutch 0.9)
However, I have some concerns: 
As log in screen, and review code, the script re-index all database --> low
speed (long as index new one)
--> Is there any way to re-index only changed pages?
The generate step is long also
--> can improve it?
The db.default.fetch.interval is for all page
--> Is there any way to configure it adaptive, i mean some pages need to be
re-indexed every day such as home page of news site


Thanks
Nghia Nguyen.


Susam Pal wrote:
> 
> You can try the crawl script: http://wiki.apache.org/nutch/Crawl
> 
> Regards,
> Susam Pal
> 
> On Jan 13, 2008 8:36 AM, Manoj Bist <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> When I run crawl the second time, it always complains that 'crawled'
>> already
>> exists. I always need to remove this directory using 'hadoop dfs -rm
>> crawled' to get going.
>> Is there some way to avoid this error and tell nutch that its a recrawl?
>>
>> bin/nutch crawl urls -dir crawled -depth 1  2>&1 | tee /tmp/foo.log
>>
>>
>> Exception in thread "main" java.lang.RuntimeException: crawled already
>> exists.
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)
>>
>> Thanks,
>>
>> Manoj.
>>
>> --
>> Tired of reading blogs? Listen to  your favorite blogs at
>> http://www.blogbard.com   !!!!
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/%27crawled-already-exists%27---how-do-I-recrawl--tp14781783p14841677.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: 'crawled already exists' - how do I recrawl?

Reply via email to