Here is a summary of my remaining questions from the [write and sort performance] thread.
- Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB of raw disk space (based on block counts exported from namenode). Accounting for 3x replication, I was expecting 15GB. what's causing this 20% overhead? - when large amount of data is written to HFS (for example copyFromLocal), are the file block replication pipelined? Also, does one 64MB block needs to be fully replicated before the next 64MB copy can start? - is there a way to control how many mappers are actively running at a time? i.e., I would like to try matching the number of running mappers to the number of slaves to see individual mapper's performance without interference. - is there a way to force each mapper to process only 64MB of data? Some were processing 67MB during a sort. - what's the file access pattern for the mapper when the data is local? I sort of expect reading 1 local 64MB file and possibly writing out R local files each with 64/R MB worth of data, where R is the number of reducers. Is this wrong? I haven't seen a mapper task that run close to this fast. (shuffle question probably shares some answers with the mapper question... so, omit for now). - I did copyFromLocal test (dd | bin/hadoop dfs -copyFromLocal) suggested by Raghu. Both 1GB test shows the performance of 9.2MB/sec (for 2GB copy, it is around 8.3MB/sec). This is consistent to earlier random writer result (10.4MB/sec). So, it is only around 11-14% of raw disk performance. bwolen
