Here is a summary of my remaining questions from the [write and sort
performance] thread.

- Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB of
raw disk space (based on block counts exported from namenode).
Accounting for 3x replication, I was expecting 15GB. what's causing
this 20% overhead?

- when large amount of data is written to HFS (for example
copyFromLocal), are the file block replication pipelined?  Also, does
one 64MB block needs to be fully replicated before the next 64MB copy
can start?

- is there a way to control how many mappers are actively running at a
time?  i.e., I would like to try matching the number of running
mappers to the number of slaves to see individual mapper's performance
without interference.

- is there a way to force each mapper to process only 64MB of data?
Some were processing 67MB during a sort.

- what's the file access pattern for the mapper when the data is
local?  I sort of expect reading 1 local 64MB file and possibly
writing out R local files each with 64/R MB worth of data, where R is
the number of reducers.   Is this wrong?   I haven't seen a mapper
task that run close to this fast.

(shuffle question probably shares some answers with the mapper
question... so, omit for now).

- I did copyFromLocal test (dd | bin/hadoop dfs -copyFromLocal)
suggested by Raghu.  Both 1GB test shows the performance of 9.2MB/sec
(for 2GB copy, it is around 8.3MB/sec).   This is consistent to
earlier random writer result (10.4MB/sec).    So, it is only around
11-14% of raw disk performance.

bwolen

Reply via email to