Hi, I have a very specific list of URLs, which is about 140K URLs.
I switch off the `db.update.additions.allowed` so it will not update the crawldb... and I was assuming I can feed all the URLs to Nutch, and after one round of fetching, it will finish and leave all the raw HTML files in the segment folder. However, after I run this command: nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 & It ended up with a small number of URLs.. TOTAL urls: 872 retry 0: 872 min score: 1.0 avg score: 1.0 max score: 1.0 And I double check the log to make sure that every url can pass the filter and normalization. And here is the log: 2013-12-27 17:55:25,068 INFO crawl.Injector - Injector: total number of urls rejected by filters: 0 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: total number of urls injected after normalization and filtering: 139058 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: Merging injected urls into crawl db. I don't know how 140K URLs ended up being 872 in the end... /usr/bin ---------------------- AWS ubuntu instance Nutch 1.7 java version "1.6.0_27" OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.4) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

