the script work well (Nutch 0.9) However, I have some concerns: As log in screen, and review code, the script re-index all database --> low speed (long as index new one) --> Is there any way to re-index only changed pages? The generate step is long also --> can improve it? The db.default.fetch.interval is for all page --> Is there any way to configure it adaptive, i mean some pages need to be re-indexed every day such as home page of news site
Thanks Nghia Nguyen. Susam Pal wrote: > > You can try the crawl script: http://wiki.apache.org/nutch/Crawl > > Regards, > Susam Pal > > On Jan 13, 2008 8:36 AM, Manoj Bist <[EMAIL PROTECTED]> wrote: >> Hi, >> >> When I run crawl the second time, it always complains that 'crawled' >> already >> exists. I always need to remove this directory using 'hadoop dfs -rm >> crawled' to get going. >> Is there some way to avoid this error and tell nutch that its a recrawl? >> >> bin/nutch crawl urls -dir crawled -depth 1 2>&1 | tee /tmp/foo.log >> >> >> Exception in thread "main" java.lang.RuntimeException: crawled already >> exists. >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:85) >> >> Thanks, >> >> Manoj. >> >> -- >> Tired of reading blogs? Listen to your favorite blogs at >> http://www.blogbard.com !!!! >> > > -- View this message in context: http://www.nabble.com/%27crawled-already-exists%27---how-do-I-recrawl--tp14781783p14841677.html Sent from the Nutch - User mailing list archive at Nabble.com.