Konstantin Shvachko wrote:
200 bytes per file is theoretically correct, but rather optimistic :-(
From a real system memory utilization I can see that HDFS uses 1.5-2K
per file.
And since each real file is internally represented by two files (1 real
+ 1 crc) the real
estimate per file should read 3-4K.
But also note that there are plans to address these over the coming
months. For a start:
https://issues.apache.org/jira/browse/HADOOP-803
https://issues.apache.org/jira/browse/HADOOP-928
Once checksums are optional then we can replace their implementation in
HDFS with something that does not consume namespace.
Long term we hope to approach ~100 bytes per file.
Doug