> However, your solution still results in a procedure call (mlx4_bf_copy > is compiled as a procedure using gcc 4.1.0 on an X86_64 host, even if > I add "inline").
Can you give more detail on the platform and how you compiled? I can't reproduce it with gcc 4.1.3 here. Are you compiling with optimization enabled? Are other things like set_atomic_seg() getting inlined properly? > I would prefer the patch below (which does generate inline code, and does the > (sizeof(unsigned long) * 2) calculation just once). Dividing by 2 * sizeof (long) seems to generate slightly worse code for me. Since sizeof (long) is a compile time constant, in my version the compiler just generates a sub $10, while in your version there is a sub $1 instead (which costs the same) plus an extra right shift at the beginning of the loop. - R. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
