We are continuing to see a small, consistent amount of block corruption leading to file loss. We have been upgrading our cluster lately, which means we've been doing a rolling de-commissioning of our nodes (and then adding them back with more disks!).

Previously, when I've had time to investigate this very deeply, I've found issues like these:

https://issues.apache.org/jira/browse/HADOOP-4692
https://issues.apache.org/jira/browse/HADOOP-4543

I suspect that this causes some or all of our problems.

I also saw that one of our nodes was at 100.2% full; I think this is due to the same issue; Hadoop's actual usage of the file system is greater than the max capacity because some of the blocks were truncated.

I'd have to check with our sysadmins, but I think we've lost about 200-300 files during the upgrade process. Right now, there are about 900 chronically under-replicated blocks; in the past, that's meant the only replica is actually corrupt, and Hadoop is trying to relentlessly retransfer it, failing to, but not realizing the source is corrupt. To some extent, this whole issue is caused because we only have enough space for 2 replicas; I'd imagine that at 3 replicas, the issue would be much harder to trigger.

Any suggestions? For us, file loss is something we can deal with (not necessarily fun to deal with, of course), but it might not be the case in the future.

Brian

Reply via email to