For anyone searching this thread in the future. One possible cause of this is when the hadoop nodes are not time synchronized with ntp or something similar.
For example if one or more of the slave nodes is a few minutes ahead of the others and an inject job is run on one of those nodes (and this is pretty much random and up to the system as to where a job is placed so it wouldn't happen every time if only some of the nodes are out of sync) after which a generate job is run on any node that is behind the out of sync nodes (again random), then then some of the urls may not get fetched because their starting fetch time in crawl db is later than the current time on the machine that is doing the generate task. Being out of sync also seems to affect other thing such as task stalling for a couple of minutes, etc. but I don't have specific information on that. The fix for this is to setup the nodes to access a a time server in your network or setup the nodes to access a public time server and in either case make sure your nodes are time synchronized by having ntp run on startup. Dennis AJ Chen wrote: > Any idea why nutch (0.9-dev) does not try to fetch every url > generated? For > example, if Generator generates 200,000 urls, maybe <100,000 urls will be > fetched, succeeded or failed. This is a big difference, which is > obvious by > checking the number of urls in the log or run readseg -list. What > causes a > large number of urls get thrown out by the Fetcher? > > Thanks, ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
