On Wed, Jul 20, 2011 at 3:13 PM, Keren Ouaknine <ker...@gmail.com> wrote:
> Hello, > > Thank you for the script! I run it and got total execution time for > cat-ing: > > major minor fs_in fs_out wall user sys ctx_invol ctx_vol > 2 17049 0 0 1.23 1.60 0.11 18 1113 > 2 17170 0 0 1.22 1.61 0.10 22 1023 > 2 17326 0 0 1.22 1.61 0.10 33 1049 > 2 17222 0 0 1.22 1.61 0.11 23 1020 > 2 18047 0 0 1.22 1.62 0.09 18 1033 > 2 18259 0 0 *1.27* 1.61 0.11 23 1068 > 2 17555 0 0 1.22 1.62 0.09 35 1018 > 2 17633 0 0 1.22 1.61 0.10 21 1036 > 2 17459 0 0 1.22 1.61 0.10 32 1059 > 2 18040 0 0 1.22 1.61 0.10 32 1043 > > > Using reps_per_run(50) and num_trials(10), the script cat the file 50 > times. Why not just 2 or more (from the second iteration file is in buffer > cache). > I usually want to gather a lot of datapoints in order to have good confidence on a subsequent p-test. The timers from /usr/bin/time aren't that fine-grained, so having more repetitions is useful. > Also, I looked at the results and found an outlier (1.27). I would assume > execution time is longer due to load of machine at the time? > > Probably, or just a timer granularity issue. Also, note that this time includes JVM startup time. So, it makes more sense to use this to cat large files - from your results, it looks like you're catting a fairly small one. I usually use at least 128MB or 256MB, so with REPS_PER_RUN=50, it's many GB. I would like to get further information such as the cpu time and network > bandwidth consumed per node for a command. Do you know if Cloudera adds > hook points to CDH3 to measure these? Are there any other benchmarking > scripts? > > For multi-node benchmarks, we usually use the same tools as the rest of the community - ie terasort, gridmix, etc. For micro-benchmarking specific patches, I usually devise a one-off benchmark to exercise the code path in question. I've occasionally found it useful to do multinode tests while running datanodes under a profiler, or with -Xprof, as well. But, to directly answer your question, CDH3 doesn't have any special hooks beyond what Apache Hadoop has. -Todd > > On Mon, Jul 18, 2011 at 7:55 AM, Todd Lipcon <t...@cloudera.com> wrote: > >> For benchmarking CPU, I start a pseudo-distributed HDFS cluster, put a >> smallish file on the local datanode (such that it fits in buffer cache), >> and >> then use the following script with various parameters to look at CPU usage >> to cat the file. for example: >> >> $ REPS_PER_RUN=50 NUM_TRIALS=10 ./read-benchmark.sh >> hdfs://localhost/128M-file /tmp/benchmark-results.txt >> >> Script: >> >> #!/bin/sh -x >> set -e >> BINDIR=$(dirname $0) >> >> INPUT=$1 >> OUTPUT=$2 >> NUM_TRIALS=${NUM_TRIALS:-10} >> HADOOP=${HADOOP:-./bin/hadoop} >> HADOOP_FLAGS=${HADOOP_FLAGS:--Dio.file.buffer.size=$[64*1024]} >> REPS_PER_RUN=${REPS_PER_RUN:-1} >> >> >> >> HEADER="major\tminor\tfs_in\tfs_out\twall\tuser\tsys\tctx_invol\tctx_vol\n" >> TIME_FORMAT="%F\t%R\t%I\t%O\t%e\t%U\t%S\t%c\t%w" >> >> ! test -f $OUTPUT && printf $HEADER > $OUTPUT >> for x in `seq 1 $NUM_TRIALS` ; do >> /usr/bin/time --append -o $OUTPUT -f $TIME_FORMAT \ >> $HADOOP fs $HADOOP_FLAGS -cat $(for rep in $(seq 1 $REPS_PER_RUN) ; >> do echo $INPUT ; done) > /dev/null >> done >> >> >> On Wed, Jul 6, 2011 at 1:16 AM, Keren Ouaknine <ker...@gmail.com> wrote: >> >> > Hello, >> > >> > I am working on the optimization of task scheduling for Hadoop and would >> > like to benchmark with* Apache Hadoop's standards benchmarks*. So far, I >> > used my own scripts to measure and monitor. Where can I find the >> > benchmarking you are referring to please? >> > >> > Thanks, >> > Keren >> > >> > On Wed, Jul 6, 2011 at 7:32 AM, Todd Lipcon (JIRA) <j...@apache.org> >> > wrote: >> > >> > > Simplify BlockReader to not inherit from FSInputChecker >> > > ------------------------------------------------------- >> > > >> > > Key: HDFS-2129 >> > > URL: https://issues.apache.org/jira/browse/HDFS-2129 >> > > Project: Hadoop HDFS >> > > Issue Type: Sub-task >> > > Components: hdfs client >> > > Reporter: Todd Lipcon >> > > Assignee: Todd Lipcon >> > > >> > > >> > > BlockReader is currently quite complicated since it has to conform to >> the >> > > FSInputChecker inheritance structure. It would be much simpler to >> > implement >> > > it standalone. Benchmarking indicates it's slightly faster, as well. >> > > >> > > -- >> > > This message is automatically generated by JIRA. >> > > For more information on JIRA, see: >> > http://www.atlassian.com/software/jira >> > > >> > > >> > > >> > >> > >> > -- >> > Keren Ouaknine >> > Cell: +972 54 2565404 >> > Web: www.kereno.com >> > >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > > > -- > Keren Ouaknine > Cell: +972 54 2565404 > Web: www.kereno.com > > > -- Todd Lipcon Software Engineer, Cloudera