Andrzej Bialecki wrote:
Dennis Kubes wrote:
So we moved 50 machines to a data center for a beta cluster of a new
search engine based on Nutch and Hadoop. We fired all of the machines
up and started fetching and almost immediately started experiencing
JVM crashes and checksum/IO errors which would cause jobs to fail,
tasks to fail, and random data corruption. After digging through and
fixing the problems we have come up with some observations that may
seem obvious but may also help someone else avoid the same problems.
[..]
Thanks Dennis for sharing this - it's very useful.
I could add also the following from my experience: for medium-large
scale crawling, i.e. in the order of 20-100 mln pages, be prepared to
address the following issues:
* take a crash course in advanced DNS setup ;) I found that often the
bottleneck lies in DNS and not just the raw bandwidth limits. If your
fetchlist consist of many unique hosts, then Nutch will fire thousands
of DNS requests per second. Using just an ordinary setup, i.e. without
caching, is pointless (most of the time the lookups will time out) and
harmful to the target DNS servers. You have to use a caching DNS - I
have good experiences with djbdns / tinydns, but they also require
careful tuning of max. number of requests, cache size, ignoring too
short TTLs, etc.
I completely agree although we use bind. DNS issues were one of the
first things that came up when we first started using Nutch and Hadoop
over a year ago. I remember that you pointed us toward caching DNS
servers on the local machines at that time and that has made all of the
difference. Originally we were using a single DNS server in the domain
and by running large fetches (many fetchers at the same time) were were
causing a DOS attack on our own server. And the memory on the server
couldn't handle it so the entire fetch was slowing down and erroring.
I will add one point here and that is that while we run caching servers
on each machine we also using large dns caches for our lookup
nameservers such as opendns and verizon. The idea being that if we
don't have it, one of the large caches will and it is better to check
them before going directly to global nameservers. Large caches will
take one hop while global nameservers will take two. Here is what our
resolv.conf looks like. The 208 servers are OpenDNS while the 4.x
servers are verizon. Note is that both of these caches are open to
requests from anywhere so anybody should be able to use them.
nameserver 127.0.0.1
nameserver 208.67.222.222
nameserver 208.67.220.220
nameserver 4.2.2.1
nameserver 4.2.2.2
nameserver 4.2.2.3
nameserver 4.2.2.4
nameserver 4.2.2.5
* check your network infrastructure. I had a few cases of clusters that
were giving sub-standard performance, only to find that e.g. cables were
flaky. In most cases though it's the network equipment such as switches
and routers - check their CPU usage, and the number of dropped packets.
Some entry-level switches and routers, even though their interfaces
nominally support gigabit speeds, their switching fabric and/or CPU
don't support high packet rates - so they would peg at 100% cpu, and
even if they don't show any lost packets, a 'ping -f' shows they can't
handle the load.
Cables, what can I say about cables. We bought cat6 cables that when
you wiggle them (or they get moved around) decide to reset the network
card. I would have never believed that was possible. Changing the
cables to 5e fixed the problem. Weird.
There is company called trendnet that sells 24 port gigabit switches for
around 300US. So we recently switched to gigabit switches as all of our
network cards are gigabit. There was actually a problem with the eeprom
on intel e1000 network cards that was causing connections to just drop
on gigabit speeds but not 100Mb speeds. But there is a script to fix
that and since we did the connection rate for gigabit is awesome. We
are able to sustain over 50MB/s on direct file transfers. I think this
is pretty much the hard disk limit. While it may take some time to get
going I fully recommend gigabit infrastructure.
* check OS level resource limits (ulimit -a on POSIX systems). In one
installation we were experiencing weird crashes and finally discovered
that datanodes and tasktrackers were hitting OS-wide limits of open file
handles. In another installation the OS-wide limits were ok, but the
limits on this particular account were insufficient.
/var/log/messages is your friend ;). I think many people don't realize
when getting into search engines that it is as much about hardware and
system knowledge as it is about software.
Dennis Kubes