Your interest is good. I think you should ask even smaller number of questions in one mail and try to do more experimentation.
Bwolen Yang wrote:
Here is a summary of my remaining questions from the [write and sort performance] thread. - Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB of raw disk space (based on block counts exported from namenode). Accounting for 3x replication, I was expecting 15GB. what's causing this 20% overhead?
You are asuming each block is is 64M. There are some blocks for "CRC files". Did you try to du the datanode's 'data directories'?
- when large amount of data is written to HFS (for example copyFromLocal), are the file block replication pipelined? Also, does one 64MB block needs to be fully replicated before the next 64MB copy can start?
They are pipelined. Again you can experiment by trying with single replica (in config) and see if runs much faster. If it does not, then they should be pipelined.
Raghu.
