Hi Bin Wang, >> nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 & You were creating a new crawldb or reusing some old one ?
Were you running this on a cluster or in local mode ? Was there any failure due to which the fetch round got aborted ? (see logs for this). I would like to reproduce this issue. Will it be possible for you to share your config files and subset of urls ? Thanks, Tejas On Sat, Dec 28, 2013 at 2:10 AM, Talat Uyarer <[email protected]> wrote: > Hi Bin, > > You have interesting error. I don't use 1.7 but I can try with screen > command. I believe you will not get same error. > > Talat > > > 2013/12/27 Bin Wang <[email protected]> > >> Hi, >> >> I have a very specific list of URLs, which is about 140K URLs. >> >> I switch off the `db.update.additions.allowed` so it will not update the >> crawldb... and I was assuming I can feed all the URLs to Nutch, and after >> one round of fetching, it will finish and leave all the raw HTML files in >> the segment folder. >> >> However, after I run this command: >> nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 & >> >> It ended up with a small number of URLs.. >> TOTAL urls: 872 >> retry 0: 872 >> min score: 1.0 >> avg score: 1.0 >> max score: 1.0 >> >> And I double check the log to make sure that every url can pass the >> filter and normalization. And here is the log: >> >> 2013-12-27 17:55:25,068 INFO crawl.Injector - Injector: total number of >> urls rejected by filters: 0 >> 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: total number of >> urls injected after normalization and filtering: 139058 >> 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: Merging injected >> urls into crawl db. >> >> I don't know how 140K URLs ended up being 872 in the end... >> >> /usr/bin >> >> ---------------------- >> AWS ubuntu instance >> Nutch 1.7 >> java version "1.6.0_27" >> OpenJDK Runtime Environment (IcedTea6 1.12.6) >> (6b27-1.12.6-1ubuntu0.12.04.4) >> OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) >> > > > > -- > Talat UYARER > Websitesi: http://talat.uyarer.com > Twitter: http://twitter.com/talatuyarer > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 >

