Bug ID: 37101
           Summary: [x86] Failure to use both div and mod results of one
                    IDIV in a prime-factor loop while(n%i==0) { n/=i; }
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: performance
          Severity: enhancement
          Priority: P
         Component: new bugs

Near-identical code-gen to
for this loop.  From,
simplified to use a pointer instead of returning std::vector<int>:

void find_prime_factors_ptr(int n, int *p)
    // inefficient to test even numbers > 2, but that's a separate missed
    for (int i = 2; i <= n; i++) {
        while (n % i == 0) {
            *p++ = i;
            n /= i;   // reordering the loop body doesn't help
clang version 7.0.0 (trunk 329780) 

find_prime_factors_ptr(int, int*):
        movl    %edi, %ecx
        cmpl    $2, %edi
        jl      .LBB0_6
        movl    $2, %edi             # why not keep n in EDI, and use ECX for
.LBB0_2:                                # =>This Loop Header: Depth=1
        movl    %ecx, %eax
        idivl   %edi
        testl   %edx, %edx
        jne     .LBB0_5
.LBB0_3:                                #   Parent Loop BB0_2 Depth=1
        movl    %edi, (%rsi)
        movl    %ecx, %eax
        idivl   %edi
        movl    %eax, %ecx
        idivl   %edi
        addq    $4, %rsi
        testl   %edx, %edx
        je      .LBB0_3
.LBB0_5:                                #   in Loop: Header=BB0_2 Depth=1
        leal    1(%rdi), %eax    # this compare-latency optimization 
        cmpl    %ecx, %edi       # doesn't seem worth it
        movl    %eax, %edi       # 4 uops instead of 2 from defeating
        jl      .LBB0_2          # vs. inc %rdi / cmp %ecx,%edi / jle

Using idiv twice inside the loop body is totally pointless.  At either way of
reaching .LBB0_3 (fall through or loop), n/i is in EAX, but LLVM ignores it.

The inner loop can be optimized by dropping movl %ecx, %eax / cltd / idiv with
no other changes.

        movl    %edi, (%rsi)
        addq    $4, %rsi
        movl    %eax, %ecx    # n = tmp
        idivl   %edi          # eax = tmp = n/i
        testl   %edx, %edx
        je      .LBB0_3

Unlike for gcc, changing the inner loop like this didn't help:

        int tmp;
        while (tmp = n/i, n % i == 0) {
            *p++ = i;
            n = tmp;

(see the linked gcc bug report for an interesting version that jumps into the
inner loop instead of hoisting the run-zero-times check out of the loop before
the first iteration.)

I think one of the versions on the godbolt link did manage to optimize better
with clang, but I haven't investigated further.  (I think it was one of the
versions using `vector.push_back` inside the loop.)

You are receiving this mail because:
You are on the CC list for the bug.
llvm-bugs mailing list

Reply via email to