Hi, I have a very specific list of URLs to crawl and I implemented it by turning off this property: <property> <name>db.update.additions.allowed</name> <value>false</value> <description>If true, updatedb will add newly discovered URLs, if false only already existing URLs in the CrawlDb will be updated and no new URLs will be added. </description> </property>
So it will not add the parsed new URLs / outbound links into the crawldb. I tried to feed one link to Nutch and it works exactly the way I want, and I can read the raw HTML by going to the deserialize the segment/content/part-000/data file. However, when I feed 520 URLs to Nutch, the result is confusing me. It created 3 separate folders and each one has the same structure as the folder I just mentioned. When I check the data files in each folder... folder 1 contains: 400 URLs and their HTML folder 2 contains: 487 URLs .. folder 3 contains: 520 URLs .. And they add up to about 1400! There are many duplicates when you add them up and there are 900 distinct URLs in total which is even more than the URLs that I fed Nutch. Here is the research that I have done: I have read the source code for Injector and am working on Fetcher.. Somehow it mentioned in the Fetcher that "the number of Queues is based on the number of hosts..." And I am wondering does that have anything to do with how that three folders come. Can anyone help me understand how that three folders come to existence and why the URLs number is so weird. Any hint is appreciated or point me to the right class so I can do some homework myself. ------------------ Extra Info: I am using AWS EC2 Ubuntu and Nutch 1.7 the command to run the crawl: nohup bin/nutch crawl urls -dir result -depth 3 -topN 10000 & ------------------- /usr/bin

