On Sun, 10 Sep 2006, Kevin Brown wrote:

> Tom Lane wrote:
> > (does anyone know the cost of ntohl() on modern
> > Intel CPUs?)
> I have a system with an Athlon 64 3200+ (2.0 GHz) running in 64-bit
> mode, another one with the same processor running in 32-bit mode, a a
> third running a Pentium 4 1.5 GHz processor, and a fourth running a
> pair of 2.8 GHz Xeons in hyperthreading mode.
> I compiled the test program on the 32-bit systems with the -std=c9x
> option so that the constant would be treated as unsigned.  Other than
> that, the compilation method I used was identical: no optimization,
> since it would skip the loop entirely in the version without the
> ntohl() call.  I compiled it both with and without defining
> CALL_NTOHL, and measured the difference in billed CPU seconds.
> Based on the above, on both Athlon 64 systems, each ntohl() invocation
> and assignment takes 1.04 nanoseconds to complete (I presume the
> assignment is to a register, but I'd have to examine the assembly to
> know for sure).  On the 1.5 GHz P4 system, each iteration takes 8.49
> nanoseconds.  And on the 2.8 GHz Xeon system, each iteration takes
> 5.01 nanoseconds.

Of course, that depends on the particular OS and variant as well.  IIRC,
at some point an instruction was added to x86 instruction set to do byte

This is from /usr/include/netinet/in.h on a gentoo linux box with glibc

#ifdef __OPTIMIZE__
/* We can optimize calls to the conversion functions.  Either nothing has
   to be done or we are using directly the byte-swapping functions which
   often can be inlined.  */
/* The host byte order is the same as network byte order,
   so these functions are all just identity.  */
# define ntohl(x)       (x)
# define ntohs(x)       (x)
# define htonl(x)       (x)
# define htons(x)       (x)
# else
#   define ntohl(x)     __bswap_32 (x)
#   define ntohs(x)     __bswap_16 (x)
#   define htonl(x)     __bswap_32 (x)
#   define htons(x)     __bswap_16 (x)
#  endif
# endif

And from bits/byteswap.h

/* To swap the bytes in a word the i486 processors and up provide the
   `bswap' opcode.  On i386 we have to use three instructions.  */
#  if !defined __i486__ && !defined __pentium__ && !defined __pentiumpro__ \
      && !defined __pentium4__
#   define __bswap_32(x)                                                      \
     (__extension__                                                           \
      ({ register unsigned int __v, __x = (x);                                \
         if (__builtin_constant_p (__x))                                      \
           __v = __bswap_constant_32 (__x);                                   \
         else                                                                 \
           __asm__ ("rorw $8, %w0;"                                           \
                    "rorl $16, %0;"                                           \
                    "rorw $8, %w0"                                            \
                    : "=r" (__v)                                              \
                    : "0" (__x)                                               \
                    : "cc");                                                  \
         __v; }))
#  else
#   define __bswap_32(x) \
     (__extension__                                                           \
      ({ register unsigned int __v, __x = (x);                                \
         if (__builtin_constant_p (__x))                                      \
           __v = __bswap_constant_32 (__x);                                   \
         else                                                                 \
           __asm__ ("bswap %0" : "=r" (__v) : "0" (__x));                     \
         __v; }))
#  endif

/me searches around his hard drive for the ia32 developers reference

Opcode          Instruction     Description
0F C8+rd        BSWAP r32       Reverse the byte order of a 32-bit register


The BSWAP instruction is not supported on IA-32 processors earlier than
the Intel486 processor family. ...

I have read some odd stuff about instructions like these.  Apparently the
fact that this is a "prefixed instruction" (the 0F byte at the beginning)
costs an extra clock cycle, so though this instruction should take 1
cycle, it ends up taking 2.  I am unclear whether or not this is rectified
in later pentium chips.

So to answer the question about how much ntohl costs on recent Intel
boxes, a properly optimized build with a friendly libc like I quoted
should be able to do it in 2 cycles.

In Ohio, if you ignore an orator on Decoration day to such an extent as
to publicly play croquet or pitch horseshoes within one mile of the
speaker's stand, you can be fined $25.00.

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?


Reply via email to