Konstantin Shvachko wrote:
200 bytes per file is theoretically correct, but rather optimistic :-(
From a real system memory utilization I can see that HDFS uses 1.5-2K per file. And since each real file is internally represented by two files (1 real + 1 crc) the real
estimate per file should read 3-4K.

But also note that there are plans to address these over the coming months. For a start:

https://issues.apache.org/jira/browse/HADOOP-803
https://issues.apache.org/jira/browse/HADOOP-928

Once checksums are optional then we can replace their implementation in HDFS with something that does not consume namespace.

Long term we hope to approach ~100 bytes per file.

Doug

Reply via email to