Hmm I don't think the crawler is being blocked of politeness because I am using 
the default Nutch configuration which is 1 request per second.




And when I try to crawl with the sample URL by disabling Nutch plugin in the 
Nutch-site.xml I can retrieve some links.




The problem seems to be in the selenium plugin. Though Firefox pops nothing is 
fetched.



Best,
Nagarjun Pola

On Thu, Feb 19, 2015 at 10:05 PM, Jiaxin Ye <[email protected]> wrote:

> Hi, my teammate is also suffering from this situation now and I encountered
> this situation last night. But I am able to crawl now almost without doing
> anything. The reason I may guess is that your crawler is blocked by the
> website because not being polite. At least I believe that's the reason why
> I got the same *Could not initialize class org.apache.http.impl.conn.* last
> night. I don't how to solve it, though..... Fortunate enough I think I am
> unbanned now, I guess? Hope it helps......
> On Thu, Feb 19, 2015 at 9:44 PM, Nagarjun Pola <[email protected]> wrote:
>> I get the following error when tried with selenium. Firefox pops up couple
>> of times but fetches nothing.
>>
>> Can anyone help me on this issue?
>>
>> *-activeThreads=50, spinWaiting=49, fetchQueues.totalSize=1,
>> fetchQueues.getQueueCount=1*
>>
>> ** queue: http://gcmd.gsfc.nasa.gov <http://gcmd.gsfc.nasa.gov>*
>>
>> *  maxThreads    = 1*
>>
>> *  inProgress    = 1*
>>
>> *  crawlDelay    = 5000*
>>
>> *  minCrawlDelay = 0*
>>
>> *  nextFetchTime = 1424410799976*
>>
>> *  now           = 1424410803146*
>>
>> *  0. http://gcmd.gsfc.nasa.gov/ <http://gcmd.gsfc.nasa.gov/>*
>>
>> *fetch of
>> http://gcmd.gsfc.nasa.gov/KeywordSearch/Home.do?Portal=amd&MetadataType=0
>> <http://gcmd.gsfc.nasa.gov/KeywordSearch/Home.do?Portal=amd&MetadataType=0>
>> failed with: java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory*
>>
>> *-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=1,
>> fetchQueues.getQueueCount=1*
>>
>> ** queue: http://gcmd.gsfc.nasa.gov <http://gcmd.gsfc.nasa.gov>*
>>
>> *  maxThreads    = 1*
>>
>> *  inProgress    = 0*
>>
>> *  crawlDelay    = 5000*
>>
>> *  minCrawlDelay = 0*
>>
>> *  nextFetchTime = 1424410808305*
>>
>> *  now           = 1424410804147*
>>

Reply via email to