The mechanism as-is for calculating TLB misses does not appear to work. At least on PPC970 which I have courtesy of Terrasoft, the DTLB misses and page table walk cycles cannot be calculated at the same time and it is not the method used in the paper I wrote on the calculation of TLB miss costs.
This patch implements TLB miss cost calculations with oprofile as described on the paper. It uses two separate runs to calculate TLB misses and the time spent walking the page tables when running the STREAM benchmark. As the rate the events occur at deviate so significantly, the oprofile reports would ordinarily not be compatible. To get around this, the second oprofile run is executed multiple times until the number of cycles spent running STREAM are roughly comparable. It then calculates the cost of a TLB miss based on the number of cycles spent walking the page tables and the number of misses that occured. Example output looks like $ tlbmiss_cost.sh --verbose TRACE: Forcing shutdown of oprofile TRACE: Gathering the name of CPU events TRACE: Warming the benchmark to avoid page faults of the binary TRACE: oprofile launch commands as follows TRACE: dtlb misses :: oprofile_start --event timer --event dtlb_miss --sample-cycle-factor 5 TRACE: tablewalk cycles :: oprofile_start --event timer30 --event tablewalk_cycles --sample-cycle-factor 5 --sample-event-factor TRACE: Rerunning benchmark to measure number of DTLB misses TRACE: Shutting down oprofile TRACE: Running benchmark to measure table walk cycles TRACE: High diff with scaling x1. Required 595239 +/ 23809, got 158913 TRACE: High diff with scaling x4. Required 595239 +/ 23809, got 369069 TRACE: High diff with scaling x7. Required 595239 +/ 23809, got 490783 TRACE: High diff with scaling x9. Required 595239 +/ 23809, got 553222 TRACE: Cycle DTLB Scale: 50000 TRACE: Cycle Walk Scale: 50000 TRACE: DTLB Scale: 1000 TRACE: Walk Scale: 10000 TRACE: Cycle DTLB events: 594671 + 568 = 595239 TRACE: Cycle Walk events: 577426 + 1308 = 578734 TRACE: DTLB events: 9789 + 104 = 9893 TRACE: Walk events: 516100 + 1359 = 517459 TLB_MISS_COST=523 So, on PPC970 at least, the cost of a TLB miss is approximately 523 cycles. The method still needs to be verified on different families of POWER. Signed-off-by: Mel Gorman <m...@csn.ul.ie> --- tlbmiss_cost.sh | 164 ++++++++++++++++++++++++++++++++++++++++++++----------- 1 files changed, 132 insertions(+), 32 deletions(-) diff --git a/tlbmiss_cost.sh b/tlbmiss_cost.sh index 792f62d..f376871 100755 --- a/tlbmiss_cost.sh +++ b/tlbmiss_cost.sh @@ -8,6 +8,7 @@ # # Licensed under LGPL 2.1 as packaged with libhugetlbfs # (c) Eric B Munson 2009 +# (c) Mel Gorman 2009 # calibrator can be found here: # http://homepages.cwi.nl/~manegold/Calibrator/v0.9e/calibrator.c @@ -186,7 +187,7 @@ calibrator_calc() # This method uses the stream memory benchmark which can be found here: # http://www.cs.virginia.edu/stream/FTP/Code/stream.c # and should be compiled with this command line: -# gcc -O3 -DN=44739240 stream.c -o STREAM +# gcc -m32 -O3 -DN=44739240 stream.c -o STREAM # and then placed in the same directory as this script stream_fetch() @@ -255,53 +256,152 @@ oprofile_calc() die "Unable to locate oprofile_start.sh." fi - print_trace Beginning TLB measurement using STREAM and oprofile + print_trace Forcing shutdown of oprofile + opcontrol --shutdown >/dev/null 2>&1 + opcontrol --deinit >/dev/null 2>&1 + print_trace Gathering the name of CPU events + TIMER_DTLB_EVENT=`oprofile_map_events.pl --event timer | cut -d: -f1 2> /dev/null` + TIMER_WALK_EVENT=`oprofile_map_events.pl --event timer30 | cut -d: -f1 2> /dev/null` + WALK_EVENT=`oprofile_map_events.pl --event tablewalk_cycles | cut -d: -f1 2> /dev/null` + DTLB_EVENT=`oprofile_map_events.pl --event dtlb_miss | cut -d: -f1 2> /dev/null` - # First we run without hugepages to get a baseline - print_trace Starting oprofile_start --event dtlb_miss --event tablewalk_cycles - $OPST --event dtlb_miss --event tablewalk_cycles >/dev/null 2>&1 - if [ "$?" -ne "0" ]; then - die "Error starting oprofile, check oprofile_map_event.pl for appropriate dtlb_miss and tablewalk_cycles events." - fi + print_trace Warming the benchmark to avoid page faults of the binary + $STREAM >/dev/null 2>&1 - print_trace Gathering baseline of STREAM performance + # Help debug problems launching oprofile + print_trace oprofile launch commands as follows + print_trace dtlb misses :: oprofile_start --event timer --event dtlb_miss --sample-cycle-factor 5 + print_trace tablewalk cycles :: oprofile_start --event timer30 --event tablewalk_cycles --sample-cycle-factor 5 --sample-event-factor $SAMPLE_EVENT_FACTOR + + print_trace Rerunning benchmark to measure number of DTLB misses + $OPST --event timer --event dtlb_miss --sample-cycle-factor 5 >/dev/null 2>&1 || \ + die "Error starting oprofile, check oprofile_map_event.pl for appropriate timer and dtlb_miss events." $STREAM >/dev/null 2>&1 - print_trace Stopping profile opcontrol --stop >/dev/null 2>&1 opcontrol --dump >/dev/null 2>&1 - print_trace Dumping profile - RESULTS=`opreport | grep "_MISS_" | head -1` - MISS_SCALE=`echo $RESULTS | sed 's/.* count \([0-9]*\).*/\1/'` - RESULTS=`opreport | grep "_TABLEWALK_" | head -1` - WALK_SCALE=`echo $RESULTS | sed 's/.* count \([0-9]*\).*/\1/'` - RESULTS=`opreport | grep " STREAM" | head -1` - - # This grabs the first whole number on the line which will be the - # number of table walk cycles, the second will be TLB misses - print_trace Analysing profile - TABLE_WALK=`echo "$RESULTS" | sed 's/[[:space:]]*\([0-9]*\).*/\1/'` - TLB_MISS=`echo "$RESULTS" | sed 's/[[:space:]]*[0-9]*[[:space:]]*[[:graph:]]*[[:space:]]*\([0-9]*\).*/\1/'` - RESULTS=`opreport | grep " vmlinux" | head -1` - KERN_TABLE_WALK=`echo "$RESULTS" | sed 's/[[:space:]]*\([0-9]*\).*/\1/'` - KERN_TLB_MISS=`echo "$RESULTS" | sed 's/[[:space:]]*[0-9]*[[:space:]]*[[:graph:]]*[[:space:]]*\([0-9]*\).*/\1/'` - if [[ "$KERN_TLB_MISS" != "" ]]; then - TLB_MISS=$(($TLB_MISS+$KERN_TLB_MISS)) + # First ensure that the location of event counters are where we + # expect them to be. The expectation is that the timer30 is in + # the first column and the tablewalk_cycles is in the third + SAMPLES_START=`opreport | grep -n "^\s*samples" | head -1 | cut -d: -f1` + if [ "$SAMPLES_START" = "" ]; then + die Could not establish start of samples from opreport + SAMPLES_START=$(($COUNT_START+1)) + fi + INDEX=`opreport | head -$SAMPLES_START | grep "^Counted .* events" | grep -n $TIMER_DTLB_EVENT | cut -d: -f1` + TIMER_DTLB_FIELD=$((1+2*($INDEX - 1))) + if [ $TIMER_DTLB_FIELD -ne 1 ]; then + die Timer events are not in the expected column, parse failure fi - if [[ "$KERN_TABLE_WALK" != "" ]]; then - TABLE_WALK=$(($TABLE_WALK+$KERN_TABLE_WALK)) + INDEX=`opreport | head -$SAMPLES_START | grep "^Counted .* events" | grep -n $DTLB_EVENT | cut -d: -f1` + DTLB_FIELD=$((1+2*($INDEX - 1))) + if [ $DTLB_FIELD -ne 3 ]; then + die DTLB miss events are not in the expected column, parse failure fi + # Columns look ok, extract the event counts + TIMER_DTLB_SCALE=`opreport | grep "$TIMER_DTLB_EVENT" | head -1 | sed 's/.* count \([0-9]*\).*/\1/'` + DTLB_SCALE=`opreport | grep "$DTLB_EVENT" | head -1 | sed 's/.* count \([0-9]*\).*/\1/'` + + RESULTS=`opreport | grep " STREAM" | head -1` + STREAM_TIMER_DTLB=`echo "$RESULTS" | sed 's/[[:space:]]*\([0-9]*\).*/\1/'` + STREAM_DTLB=`echo "$RESULTS" | sed 's/[[:space:]]*[0-9]*[[:space:]]*[[:graph:]]*[[:space:]]*\([0-9]*\).*/\1/'` + RESULTS=`opreport | grep " vmlinux" | head -1` + KERNEL_TIMER_DTLB=`echo "$RESULTS" | sed 's/[[:space:]]*\([0-9]*\).*/\1/'` + KERNEL_DTLB=`echo "$RESULTS" | sed 's/[[:space:]]*[0-9]*[[:space:]]*[[:graph:]]*[[:space:]]*\([0-9]*\).*/\1/'` + + TIMER_DTLB=$(($STREAM_TIMER_DTLB+$KERNEL_TIMER_DTLB)) + DTLB=$(($STREAM_DTLB+$KERNEL_DTLB)) + print_trace Shutting down oprofile opcontrol --shutdown >/dev/null 2>&1 opcontrol --deinit >/dev/null 2>&1 - TLB_MISS=$(($TLB_MISS*$MISS_SCALE)) - TABLE_WALK=$(($TABLE_WALK*$WALK_SCALE)) + # Next STREAM needs to be run measuring the tablewalk_cycles. Because + # of differences in teh frequency CPU events occur, there are + # alterations in the timing. To make an accurate comparison, the + # cycle counts of the two profiles need to be very similar. oprofile + # does not give much help here in matching up different reports taking + # different readings so there is nothing really to do but run STREAM + # multiple times, scaling the events at different rates until a + # reasonably close match is found. + + # The cycle counts for two oprofiles must be within 10% of each other + TOLERANCE=$(($TIMER_DTLB*4/100)) + SAMPLE_EVENT_FACTOR=1 + LOW_TIMER_WALK=0 + HIGH_TIMER_WALK=0 + + print_trace Running benchmark to measure table walk cycles + while [ $TIMER_DTLB -ge $LOW_TIMER_WALK -a $TIMER_DTLB -ge $HIGH_TIMER_WALK ]; do + + if [ $LOW_TIMER_WALK -ne 0 ]; then + print_trace High diff with scaling x$LAST_SAMPLE_EVENT_FACTOR. Required $TIMER_DTLB +/ $TOLERANCE, got $TIMER_WALK + fi + + $OPST --event timer30 --event tablewalk_cycles --sample-cycle-factor 5 --sample-event-factor $SAMPLE_EVENT_FACTOR >/dev/null 2>&1 || \ + die "Error starting oprofile, check oprofile_map_event.pl for appropriate timer30 and tablewalk_cycles events." + $STREAM >/dev/null 2>&1 + + opcontrol --stop >/dev/null 2>&1 + opcontrol --dump >/dev/null 2>&1 + + # Extract the event counts + TIMER_WALK_SCALE=`opreport | grep "$TIMER_WALK_EVENT" | head -1 | sed 's/.* count \([0-9]*\).*/\1/'` + WALK_SCALE=`opreport | grep "$WALK_EVENT" | head -1 | sed 's/.* count \([0-9]*\).*/\1/'` + + RESULTS=`opreport | grep " STREAM" | head -1` + STREAM_TIMER_WALK=`echo "$RESULTS" | sed 's/[[:space:]]*\([0-9]*\).*/\1/'` + STREAM_WALK=`echo "$RESULTS" | sed 's/[[:space:]]*[0-9]*[[:space:]]*[[:graph:]]*[[:space:]]*\([0-9]*\).*/\1/'` + RESULTS=`opreport | grep " vmlinux" | head -1` + KERNEL_TIMER_WALK=`echo "$RESULTS" | sed 's/[[:space:]]*\([0-9]*\).*/\1/'` + KERNEL_WALK=`echo "$RESULTS" | sed 's/[[:space:]]*[0-9]*[[:space:]]*[[:graph:]]*[[:space:]]*\([0-9]*\).*/\1/'` + + TIMER_WALK=$(($STREAM_TIMER_WALK+$KERNEL_TIMER_WALK)) + WALK=$(($STREAM_WALK+$KERNEL_WALK)) + + LOW_TIMER_WALK=$(($TIMER_WALK-$TOLERANCE)) + HIGH_TIMER_WALK=$(($TIMER_WALK+$TOLERANCE)) + + # Scale faster if the difference between timers is huge + LAST_SAMPLE_EVENT_FACTOR=$SAMPLE_EVENT_FACTOR + if [ $(($TIMER_DTLB*3/4-$HIGH_TIMER_WALK)) -gt 0 ]; then + SAMPLE_EVENT_FACTOR=$(($SAMPLE_EVENT_FACTOR+3)) + elif [ $(($TIMER_DTLB*9/10-$HIGH_TIMER_WALK)) -gt 0 ]; then + SAMPLE_EVENT_FACTOR=$(($SAMPLE_EVENT_FACTOR+2)) + else + SAMPLE_EVENT_FACTOR=$(($SAMPLE_EVENT_FACTOR+1)) + fi + + opcontrol --shutdown >/dev/null 2>&1 + opcontrol --deinit >/dev/null 2>&1 + done + + print_trace "Cycle DTLB Scale: $TIMER_DTLB_SCALE" + print_trace "Cycle Walk Scale: $TIMER_WALK_SCALE" + print_trace "DTLB Scale: $DTLB_SCALE" + print_trace "Walk Scale: $WALK_SCALE" + print_trace "Cycle DTLB events: $STREAM_TIMER_DTLB + $KERNEL_TIMER_DTLB = $TIMER_DTLB" + print_trace "Cycle Walk events: $STREAM_TIMER_WALK + $KERNEL_TIMER_WALK = $TIMER_WALK" + print_trace "DTLB events: $STREAM_DTLB + $KERNEL_DTLB = $DTLB" + print_trace "Walk events: $STREAM_WALK + $KERNEL_WALK = $WALK" + + # This shouldn't happen. One would expect that the minimum sample + # rate for any of the timers in any groups is the same. If they + # differ, it might be a simple bug in oprofile_map_event that + # needs fixing. In the event this bug is reported, get the CPU + # type and the output of opcontrol --list-events + if [ $TIMER_DTLB_SCALE -ne $TIMER_WALK_SCALE ]; then + die Cycle CPUs were sampled at different rates. + fi - LAST_LATENCY_CYCLES=$(($TABLE_WALK/$TLB_MISS)) + # Final calculation. Ignore kernel overhead in this version as it + # might be due to interference from other processes + TLB_MISSES=$(($DTLB_SCALE*$DTLB)) + TLB_CYCLES=$(($WALK_SCALE*$WALK)) + LAST_LATENCY_CYCLES=$(($TLB_CYCLES/$TLB_MISSES)) } ARGS=`getopt -o c:s:vq --long calibrator:,stream:,verbose,quiet,fetch-calibrator,fetch-stream -n 'tlbmiss_cost.sh' -- "$@"` -- 1.6.3.3 ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel