On Dec 9, 2008, at 4:58 PM, Edward Capriolo wrote:
Also it might be useful to strongly word hadoop-default.conf as many people might not know a downside exists for using 2 rather then 3 as the replication factor. Before reading this thread I would have thought 2 to be sufficient.
I think 2 should be sufficient, but running with 2 replicas instead of 3 exposes some namenode bugs which are harder to trigger.
For example, let's say your system has 100 nodes and 1M blocks. Let's say a namenode bug affects replica of block X on node Y and the namenode doesn't realize it. Then, there is a 1% chance that when another node goes down, the block becomes missing. If this bug is cumulative or affects many blocks (I suspect about 500-1000 blocks are problematic out of 1M), you're almost guaranteed to lose data whenever a single node goes down.
On the other hand, if you have 1000 block replica problems on the same cluster with 3 replicas, in order to lose files, two of the block replica problems must be the same block and the node which goes down must hold the third block. The probability of this happening is (1e-6) * (1e-6) * (1/100) = 1e-14, or 0.0000000000001%.
So, even assuming that I did all my probability calculations wrong, a site running with 2 replicas is more than 10 orders of magnitude more likely to discover inconsistencies or other bugs in the name node than a site with 3 replicas.
Accordingly, these sites are the "canaries in the coal mine" to discover NameNode bugs.
Brian
