You ran 3 rounds of nutch crawl ("-depth 3") and those 3 folders are 3
segments created for each round of crawl.
About the 520 URLs, I don't see any obvious reason for that happening. You
should see few of the new urls that were added, what were their parent url
and then run a small crawl using those parent as seeds.Thanks, Tejas On Tue, Dec 24, 2013 at 8:06 AM, Bin Wang <[email protected]> wrote: > Hi, > > I have a very specific list of URLs to crawl and I implemented it by > turning off this property: > <property> > <name>db.update.additions.allowed</name> > <value>false</value> > <description>If true, updatedb will add newly discovered URLs, if false > only already existing URLs in the CrawlDb will be updated and no new > URLs will be added. > </description> > </property> > > So it will not add the parsed new URLs / outbound links into the crawldb. > > I tried to feed one link to Nutch and it works exactly the way I want, and > I can read the raw HTML by going to the deserialize the > segment/content/part-000/data file. > > However, when I feed 520 URLs to Nutch, the result is confusing me. > It created 3 separate folders and each one has the same structure as the > folder I just mentioned. When I check the data files in each folder... > folder 1 contains: > 400 URLs and their HTML > folder 2 contains: > 487 URLs .. > folder 3 contains: > 520 URLs .. > > And they add up to about 1400! There are many duplicates when you add them > up and there are 900 distinct URLs in total which is even more than the > URLs that I fed Nutch. > > Here is the research that I have done: > I have read the source code for Injector and am working on Fetcher.. > Somehow it mentioned in the Fetcher that "the number of Queues is based on > the number of hosts..." And I am wondering does that have anything to do > with how that three folders come. > > Can anyone help me understand how that three folders come to existence and > why the URLs number is so weird. > > Any hint is appreciated or point me to the right class so I can do some > homework myself. > ------------------ > Extra Info: > I am using AWS EC2 Ubuntu and Nutch 1.7 > the command to run the crawl: > nohup bin/nutch crawl urls -dir result -depth 3 -topN 10000 & > ------------------- > > /usr/bin > > > > > >

