>The DFS is stored in /tmp on each box. > The developers who own the machines occasionally reboot and reprofile them
Wont you lose your blocks after reboot since /tmp gets cleaned up? Could this be the reason you see data corruption? Good idea is to configure DFS to be any place other than /tmp Thanks, Lohit ----- Original Message ---- From: Jeff Eastman <[EMAIL PROTECTED]> To: hadoop-user@lucene.apache.org Sent: Wednesday, January 16, 2008 9:32:41 AM Subject: Platform reliability with Hadoop I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen machines in our CUBiT array for the last month. During this time I have experienced two major data corruption losses on relatively small amounts of data (<50gb) that make me wonder about the suitability of this platform for hosting Hadoop. CUBiT is one of our products for managing a pool of development servers, allowing developers to check out machines, install various OS profiles on them and monitor their utilization via the web. With most machines reporting very low utilization it seemed a natural place to run Hadoop in the background. I have an NFS-mounted account on all of the machines and have installed Hadoop there. The DFS is stored in /tmp on each box. The developers who own the machines occasionally reboot and reprofile them, but this occurs infrequently and does not clobber /tmp. Hadoop is designed to deal with slave failures of this nature, though this platform may well be an acid test. My initial cloud was configured for replication factor of 3 and I have increased that now to 4 in hopes of improving data reliability in the face of these more-prevalent slave outages. Ted Dunning has suggested aggressive rebalancing in his recent posts and I have done this by increasing replication to 5 (from 3) and then dropping it to 4. Are there other rebalancing or configuration techniques that might improve my data reliability? Or, is this platform just too unstable to be a good fit for Hadoop? Jeff