It definitely sounds interesting, kind of like gridmix.  I think that there are 
three big issues here.

  1. Where are you going to store all of the data, or are you just going to 
generate random data?  If it is random data then you can do this almost totally 
form an anonymised version of the audit logs (need something to store the 
lengths of the writes/reads probably on the DataNodes themselves).
  2. How are you going to deal with multiple machines and network saturation?  
A typical Hadoop cluster is going to have accesses from many many different 
machines and in aggregate is likely to saturate the network connection from any 
single box.  You will need a way to replay this from many different machines, 
probably all machines in the cluster, preferably in at least a slightly 
coordinated way.
  3.  I assume that for most clusters HDFS is primarily accessed from within 
that cluster, by MapReduce jobs, yes there is a lot of Hbase too and there are 
probably similar problems with that.  The JobTracker/ResourceManager tries very 
hard to put jobs close to the data, the same rack most of the time, and the 
same node some of the time.  Because HDFS is not deterministic in how to 
assigns blocks we are likely to see very different performance characteristics 
with respect to the locality of accesses when replaying a log, then we are on 
the original.  I don't think that this is super critical, but if this is not 
addressed and we optimize for these benchmarks we are likely going to optimize 
more for remote accesses then a typical cluster sees.

I think it is a great idea, it is just going to be a lot of work to get it 
right.

--Bobby Evans

On 4/28/12 8:15 PM, "Colin McCabe" <cmcc...@alumni.cmu.edu> wrote:

Here is an interesting idea: recording traces of the filesystem
operations applications do, and allowing these traces to be replayed
later.

> ioreplay is mainly intended for replaying of recorded (using strace) IO 
> traces, which is useful for standalone
> benchmarking. It provides many features to ensure validity of such 
> measurements.

http://code.google.com/p/ioapps/

Sounds like something we should consider doing for HDFS performance testing...

regards,
Colin

  • ioreply Colin McCabe
    • Re: ioreply Robert Evans

Reply via email to