Ted Dunning wrote:
I haven't done a detailed comparison, but I have seen some effects: A) raid doesn't usually work really well on low-end machines compared to independent drives. This would make me distrust raid. B) hadoop doesn't do very well, historically speaking with more than one partition if the partitions are not roughly equal in size. Quite frankly, it doesn't even do all that well with datanodes that have radically different storage availability. C) with raid-0, if you lose either drive, you lose both. With separate partitions, you can lose one drive and retain the other. These lead to opposite conclusions, so I don't know what to recommend. If I had to choose, I think I would do without RAID.
D) Out of 5 disks, if one of them is slow (not that uncommon), then whole RAID will run only as fast that disk.
On smaller cluster (<100 nodes), RAID is simpler over all since probability of bad disks is less.
Raghu.