Hmm I don't think the crawler is being blocked of politeness because I am using the default Nutch configuration which is 1 request per second.
And when I try to crawl with the sample URL by disabling Nutch plugin in the Nutch-site.xml I can retrieve some links. The problem seems to be in the selenium plugin. Though Firefox pops nothing is fetched. Best, Nagarjun Pola On Thu, Feb 19, 2015 at 10:05 PM, Jiaxin Ye <[email protected]> wrote: > Hi, my teammate is also suffering from this situation now and I encountered > this situation last night. But I am able to crawl now almost without doing > anything. The reason I may guess is that your crawler is blocked by the > website because not being polite. At least I believe that's the reason why > I got the same *Could not initialize class org.apache.http.impl.conn.* last > night. I don't how to solve it, though..... Fortunate enough I think I am > unbanned now, I guess? Hope it helps...... > On Thu, Feb 19, 2015 at 9:44 PM, Nagarjun Pola <[email protected]> wrote: >> I get the following error when tried with selenium. Firefox pops up couple >> of times but fetches nothing. >> >> Can anyone help me on this issue? >> >> *-activeThreads=50, spinWaiting=49, fetchQueues.totalSize=1, >> fetchQueues.getQueueCount=1* >> >> ** queue: http://gcmd.gsfc.nasa.gov <http://gcmd.gsfc.nasa.gov>* >> >> * maxThreads = 1* >> >> * inProgress = 1* >> >> * crawlDelay = 5000* >> >> * minCrawlDelay = 0* >> >> * nextFetchTime = 1424410799976* >> >> * now = 1424410803146* >> >> * 0. http://gcmd.gsfc.nasa.gov/ <http://gcmd.gsfc.nasa.gov/>* >> >> *fetch of >> http://gcmd.gsfc.nasa.gov/KeywordSearch/Home.do?Portal=amd&MetadataType=0 >> <http://gcmd.gsfc.nasa.gov/KeywordSearch/Home.do?Portal=amd&MetadataType=0> >> failed with: java.lang.NoClassDefFoundError: Could not initialize class >> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory* >> >> *-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=1, >> fetchQueues.getQueueCount=1* >> >> ** queue: http://gcmd.gsfc.nasa.gov <http://gcmd.gsfc.nasa.gov>* >> >> * maxThreads = 1* >> >> * inProgress = 0* >> >> * crawlDelay = 5000* >> >> * minCrawlDelay = 0* >> >> * nextFetchTime = 1424410808305* >> >> * now = 1424410804147* >>

