Re: Hardware Crashes and Garbage Collection on Nutch/Hadoop

Andrzej Bialecki Sat, 21 Apr 2007 03:22:52 -0700

Dennis Kubes wrote:

So we moved 50 machines to a data center for a beta cluster of a newsearch engine based on Nutch and Hadoop. We fired all of the machinesup and started fetching and almost immediately started experiencing JVMcrashes and checksum/IO errors which would cause jobs to fail, tasks tofail, and random data corruption. After digging through and fixing theproblems we have come up with some observations that may seem obviousbut may also help someone else avoid the same problems.


[..]

Thanks Dennis for sharing this - it's very useful.

I could add also the following from my experience: for medium-largescale crawling, i.e. in the order of 20-100 mln pages, be prepared toaddress the following issues:

* take a crash course in advanced DNS setup ;) I found that often thebottleneck lies in DNS and not just the raw bandwidth limits. If yourfetchlist consist of many unique hosts, then Nutch will fire thousandsof DNS requests per second. Using just an ordinary setup, i.e. withoutcaching, is pointless (most of the time the lookups will time out) andharmful to the target DNS servers. You have to use a caching DNS - Ihave good experiences with djbdns / tinydns, but they also requirecareful tuning of max. number of requests, cache size, ignoring tooshort TTLs, etc.

* check your network infrastructure. I had a few cases of clusters thatwere giving sub-standard performance, only to find that e.g. cables wereflaky. In most cases though it's the network equipment such as switchesand routers - check their CPU usage, and the number of dropped packets.Some entry-level switches and routers, even though their interfacesnominally support gigabit speeds, their switching fabric and/or CPUdon't support high packet rates - so they would peg at 100% cpu, andeven if they don't show any lost packets, a 'ping -f' shows they can'thandle the load.

* check OS level resource limits (ulimit -a on POSIX systems). In oneinstallation we were experiencing weird crashes and finally discoveredthat datanodes and tasktrackers were hitting OS-wide limits of open filehandles. In another installation the OS-wide limits were ok, but thelimits on this particular account were insufficient.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Hardware Crashes and Garbage Collection on Nutch/Hadoop

Reply via email to