All, If you wanted to speedup these routines for processors without __builtin_clz, there are a variety of variations in C to implement clz efficiently. See Hacker's Delight nlz (number of leading zeros): http://www.hackersdelight.org/HDcode/nlz.c.txt
Or from my Ph.D. advisor's magic algorithm's page: http://aggregate.org/MAGIC/#Leading%20Zero%20Count And you can directly implement opal_next_poweroftwo() with this: http://aggregate.org/MAGIC/#Next%20Largest%20Power%20of%202 The Hacker's Delight webpage (and book) are fun to read for that certain kind of person. :-) http://www.hackersdelight.org/ On Tue, Oct 11, 2011 at 6:49 PM, <rusra...@osl.iu.edu> wrote: > Author: rusraink > Date: 2011-10-11 18:49:01 EDT (Tue, 11 Oct 2011) > New Revision: 25270 > URL: https://svn.open-mpi.org/trac/ompi/changeset/25270 > > Log: > - Check, whether the compiler supports __builtin_clz (count leading > zeroes); > if so, use it for bit-operations like opal_cube_dim and opal_hibit. > Implement two versions of power-of-two. > In case of opal_next_poweroftwo, this reduces the average execution > time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining, > measured rdtsc, with loop over 2^27 values). > Numbers for other functions are similar (but of course heavily depend > on the usage, e.g. opal_hibit() with a start of 4 does not save > much). The bsr instruction on AMD Opteron is also not as fast. > > - Replace various places where the next power-of-two is computed. > > Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and > Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes. > > > Added: > trunk/test/util/opal_bit_ops.c > Text files modified: > trunk/ompi/mca/btl/openib/btl_openib_mca.c | 13 +--- > trunk/ompi/mca/btl/sm/btl_sm.h | 5 - > trunk/ompi/mca/btl/sm/btl_sm_component.c | 9 +-- > trunk/ompi/mca/btl/wv/btl_wv_mca.c | 13 +--- > trunk/ompi/mca/coll/basic/coll_basic_reduce_scatter.c | 5 + > trunk/ompi/mca/coll/tuned/coll_tuned_allgather.c | 3 > trunk/ompi/mca/coll/tuned/coll_tuned_allreduce.c | 4 + > trunk/ompi/mca/coll/tuned/coll_tuned_barrier.c | 5 + > trunk/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c | 5 + > trunk/ompi/mca/coll/tuned/coll_tuned_reduce_scatter.c | 5 + > trunk/ompi/mca/coll/tuned/coll_tuned_topo.c | 3 > trunk/opal/class/opal_hash_table.c | 8 -- > trunk/opal/config/opal_setup_cc.m4 | 20 ++++++ > trunk/opal/util/bit_ops.h | 106 > +++++++++++++++++++++++++++++++++++---- > trunk/test/util/Makefile.am | 14 ++++- > 15 files changed, 158 insertions(+), 60 deletions(-) > [snip] -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ timat...@open-mpi.org || tmat...@gmail.com I'm a bright... http://www.the-brights.net/