Hi Bin, You have interesting error. I don't use 1.7 but I can try with screen command. I believe you will not get same error.
Talat 2013/12/27 Bin Wang <[email protected]> > Hi, > > I have a very specific list of URLs, which is about 140K URLs. > > I switch off the `db.update.additions.allowed` so it will not update the > crawldb... and I was assuming I can feed all the URLs to Nutch, and after > one round of fetching, it will finish and leave all the raw HTML files in > the segment folder. > > However, after I run this command: > nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 & > > It ended up with a small number of URLs.. > TOTAL urls: 872 > retry 0: 872 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > > And I double check the log to make sure that every url can pass the filter > and normalization. And here is the log: > > 2013-12-27 17:55:25,068 INFO crawl.Injector - Injector: total number of > urls rejected by filters: 0 > 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: total number of > urls injected after normalization and filtering: 139058 > 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: Merging injected > urls into crawl db. > > I don't know how 140K URLs ended up being 872 in the end... > > /usr/bin > > ---------------------- > AWS ubuntu instance > Nutch 1.7 > java version "1.6.0_27" > OpenJDK Runtime Environment (IcedTea6 1.12.6) > (6b27-1.12.6-1ubuntu0.12.04.4) > OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) > -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

