Dennis Kubes wrote:
So we moved 50 machines to a data center for a beta cluster of a new
search engine based on Nutch and Hadoop. We fired all of the machines
up and started fetching and almost immediately started experiencing JVM
crashes and checksum/IO errors which would cause jobs to fail, tasks to
fail, and random data corruption. After digging through and fixing the
problems we have come up with some observations that may seem obvious
but may also help someone else avoid the same problems.
[..]
Thanks Dennis for sharing this - it's very useful.
I could add also the following from my experience: for medium-large
scale crawling, i.e. in the order of 20-100 mln pages, be prepared to
address the following issues:
* take a crash course in advanced DNS setup ;) I found that often the
bottleneck lies in DNS and not just the raw bandwidth limits. If your
fetchlist consist of many unique hosts, then Nutch will fire thousands
of DNS requests per second. Using just an ordinary setup, i.e. without
caching, is pointless (most of the time the lookups will time out) and
harmful to the target DNS servers. You have to use a caching DNS - I
have good experiences with djbdns / tinydns, but they also require
careful tuning of max. number of requests, cache size, ignoring too
short TTLs, etc.
* check your network infrastructure. I had a few cases of clusters that
were giving sub-standard performance, only to find that e.g. cables were
flaky. In most cases though it's the network equipment such as switches
and routers - check their CPU usage, and the number of dropped packets.
Some entry-level switches and routers, even though their interfaces
nominally support gigabit speeds, their switching fabric and/or CPU
don't support high packet rates - so they would peg at 100% cpu, and
even if they don't show any lost packets, a 'ping -f' shows they can't
handle the load.
* check OS level resource limits (ulimit -a on POSIX systems). In one
installation we were experiencing weird crashes and finally discovered
that datanodes and tasktrackers were hitting OS-wide limits of open file
handles. In another installation the OS-wide limits were ok, but the
limits on this particular account were insufficient.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com