Re: Hardware Crashes and Garbage Collection on Nutch/Hadoop

Dennis Kubes Sat, 21 Apr 2007 07:06:32 -0700


Andrzej Bialecki wrote:

Dennis Kubes wrote:
So we moved 50 machines to a data center for a beta cluster of a newsearch engine based on Nutch and Hadoop. We fired all of the machinesup and started fetching and almost immediately started experiencingJVM crashes and checksum/IO errors which would cause jobs to fail,tasks to fail, and random data corruption. After digging through andfixing the problems we have come up with some observations that mayseem obvious but may also help someone else avoid the same problems.
[..]

Thanks Dennis for sharing this - it's very useful.
I could add also the following from my experience: for medium-largescale crawling, i.e. in the order of 20-100 mln pages, be prepared toaddress the following issues:
* take a crash course in advanced DNS setup ;) I found that often thebottleneck lies in DNS and not just the raw bandwidth limits. If yourfetchlist consist of many unique hosts, then Nutch will fire thousandsof DNS requests per second. Using just an ordinary setup, i.e. withoutcaching, is pointless (most of the time the lookups will time out) andharmful to the target DNS servers. You have to use a caching DNS - Ihave good experiences with djbdns / tinydns, but they also requirecareful tuning of max. number of requests, cache size, ignoring tooshort TTLs, etc.

I completely agree although we use bind. DNS issues were one of thefirst things that came up when we first started using Nutch and Hadoopover a year ago. I remember that you pointed us toward caching DNSservers on the local machines at that time and that has made all of thedifference. Originally we were using a single DNS server in the domainand by running large fetches (many fetchers at the same time) were werecausing a DOS attack on our own server. And the memory on the servercouldn't handle it so the entire fetch was slowing down and erroring.

I will add one point here and that is that while we run caching serverson each machine we also using large dns caches for our lookupnameservers such as opendns and verizon. The idea being that if wedon't have it, one of the large caches will and it is better to checkthem before going directly to global nameservers. Large caches willtake one hop while global nameservers will take two. Here is what ourresolv.conf looks like. The 208 servers are OpenDNS while the 4.xservers are verizon. Note is that both of these caches are open torequests from anywhere so anybody should be able to use them.


nameserver 127.0.0.1
nameserver 208.67.222.222
nameserver 208.67.220.220
nameserver 4.2.2.1
nameserver 4.2.2.2
nameserver 4.2.2.3
nameserver 4.2.2.4
nameserver 4.2.2.5

* check your network infrastructure. I had a few cases of clusters thatwere giving sub-standard performance, only to find that e.g. cables wereflaky. In most cases though it's the network equipment such as switchesand routers - check their CPU usage, and the number of dropped packets.Some entry-level switches and routers, even though their interfacesnominally support gigabit speeds, their switching fabric and/or CPUdon't support high packet rates - so they would peg at 100% cpu, andeven if they don't show any lost packets, a 'ping -f' shows they can'thandle the load.

Cables, what can I say about cables. We bought cat6 cables that whenyou wiggle them (or they get moved around) decide to reset the networkcard. I would have never believed that was possible. Changing thecables to 5e fixed the problem. Weird.

There is company called trendnet that sells 24 port gigabit switches foraround 300US. So we recently switched to gigabit switches as all of ournetwork cards are gigabit. There was actually a problem with the eepromon intel e1000 network cards that was causing connections to just dropon gigabit speeds but not 100Mb speeds. But there is a script to fixthat and since we did the connection rate for gigabit is awesome. Weare able to sustain over 50MB/s on direct file transfers. I think thisis pretty much the hard disk limit. While it may take some time to getgoing I fully recommend gigabit infrastructure.

* check OS level resource limits (ulimit -a on POSIX systems). In oneinstallation we were experiencing weird crashes and finally discoveredthat datanodes and tasktrackers were hitting OS-wide limits of open filehandles. In another installation the OS-wide limits were ok, but thelimits on this particular account were insufficient.

/var/log/messages is your friend ;). I think many people don't realizewhen getting into search engines that it is as much about hardware andsystem knowledge as it is about software.


Dennis Kubes

Re: Hardware Crashes and Garbage Collection on Nutch/Hadoop

Reply via email to