On 04/22/2011 09:48 AM, Suresh Srinivas wrote: > A few weeks ago, I had sent an email about the progress of HDFS > federation development in HDFS-1052 branch. I am happy to announce > that all the tasks related to this feature development is complete > and it is ready to be integrated into trunk.
A couple of questions: 1. Can you please describe the significant advantages this approach has over a symlink-based approach? It seems to me that one could run multiple namenodes on separate boxes and run multile datanode processes per storage box configured with something like: first datanode process configuraton fs.default.name = hdfs://nn1/ dfs.data.dir = /drive1/nn1/,drive2/nn1/... second datanode process configuraton fs.default.name = hdfs://nn2/ dfs.data.dir = /drive1/nn2/,drive2/nn2/... ... Then symlinks could be used between nn1, nn2, etc to provide a reasonably unified namespace. From the benefits listed in the design document it is not clear to me what the clear, substantial benefits are over such a configuration. 2. How much testing has been performed on this? The patch modifies much of the logic of Hadoop's central component, upon which the performance and reliability of most other components of the ecosystem depend. It seems to me that such an invasive change should be well tested before it is merged to trunk. Can you please tell me how this has been tested beyond unit tests? Thanks! Doug