Are there any published guidelines on system configuration for Hadoop? I've seen hardware suggestions, but I'm really interested in recommendations on disk layout and partitioning. The defaults, as shipped and defined in hadoop-default.xml, may be appropriate for testing, but are not really appropriate for sustained use. For example, data and metadata are both stored in /tmp. In typical use on a cluster with a couple hundred nodes, the NameNode can generate 3-5GB of logs per day. If you configure your namenode host badly, it's easy to fill up the partition used by dfs for metadata, and clobber your dfs filesystem. I would think that thresholding logs on WARN would be preferable to INFO.
On a datanode, we would like to reserve as much space as we can for data, but we know that map-reduce jobs need some local storage. How do people generally estimate the amount of space required for temporary storage? I would assume that it would be good to partition it from data storage, to prevent running out of temp space on some nodes. I would also think that it would be preferable for performance to have temp space on a different spindle, so it and hdfs data can be accessed independently. I would be interested to know how other sites configure their systems, and I would love to see some guidelines for system configuration for Hadoop. Thank you! David
