[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-20 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

In D70157#1793280 , @reames wrote:

> I've gone ahead and landed the patch so that we can iterate in tree.  See 
> commit 14fc20ca62821b5f85582bf76a467d412248c248 
> .
>
> I've also landed a couple of follow up patches to address issues which would 
> have otherwise required iteration on the review.  See commits 
> c148e2e2ef86f53391be459752511684424f331b 
> , 
> 4024d49edc1598a6f8017df541147b38bf1e2818 
> , and 
> 8b725f0459eee468ed7f9935fba3278fcb4997b1 
> .
>
> I still see some room for further cleanup (i.e. the fragment range scheme and 
> tests), but what's in is of reasonable quality.
>
> There's a couple follow up patches which are probably called for, but I think 
> we can work on these in parallel now.
>
> 1. We need to settle on assembler syntax.
> 2. We need a patch for the x86 MI to MC translation to mark regions unsafe to 
> pad.  (Probably best to separate from the above for the moment.)
> 3. We can incrementally add support for prefix padding.


Thanks for landing it! And happy holiday to all!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-20 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added a comment.

In D70157#1793280 , @reames wrote:

> I've gone ahead and landed the patch so that we can iterate in tree.  See 
> commit 14fc20ca62821b5f85582bf76a467d412248c248 
> .
>
> I've also landed a couple of follow up patches to address issues which would 
> have otherwise required iteration on the review.  See commits 
> c148e2e2ef86f53391be459752511684424f331b 
> , 
> 4024d49edc1598a6f8017df541147b38bf1e2818 
> , and 
> 8b725f0459eee468ed7f9935fba3278fcb4997b1 
> .
>
> I still see some room for further cleanup (i.e. the fragment range scheme and 
> tests), but what's in is of reasonable quality.
>
> There's a couple follow up patches which are probably called for, but I think 
> we can work on these in parallel now.
>
> 1. We need to settle on assembler syntax.
> 2. We need a patch for the x86 MI to MC translation to mark regions unsafe to 
> pad.  (Probably best to separate from the above for the moment.)
> 3. We can incrementally add support for prefix padding.


If you planned to clean up, you could have made the cleanups in the original 
land:) Though there would be a problem of proper contribution attribution. 
After Subversion->Git transition, we can actually run `git commit --amend 
--author=ask-author-to-provide-name-and- ' (Though changing the author 
may not be fair to your contribution to this commit... oh it is so difficult.) 
Anyway, it is nice to see this feature before Christmas and people can start 
investigating its impact now.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-20 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

I've gone ahead and landed the patch so that we can iterate in tree.  See 
commit 14fc20ca62821b5f85582bf76a467d412248c248 
.

I've also landed a couple of follow up patches to address issues which would 
have otherwise required iteration on the review.  See commits 
c148e2e2ef86f53391be459752511684424f331b 
, 
4024d49edc1598a6f8017df541147b38bf1e2818 
, and 
8b725f0459eee468ed7f9935fba3278fcb4997b1 
.

I still see some room for further cleanup (i.e. the fragment range scheme and 
tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think 
we can work on these in parallel now.

1. We need to settle on assembler syntax.
2. We need a patch for the x86 MI to MC translation to mark regions unsafe to 
pad.  (Probably best to separate from the above for the moment.)
3. We can incrementally add support for prefix padding.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-20 Thread Philip Reames via Phabricator via cfe-commits
This revision was not accepted when it landed; it landed in state "Needs 
Review".
This revision was automatically updated to reflect the committed changes.
Closed by commit rG14fc20ca6282: Align branches within 32-Byte boundary (NOP 
padding) (authored by reames).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157

Files:
  llvm/include/llvm/MC/MCAsmBackend.h
  llvm/include/llvm/MC/MCAssembler.h
  llvm/include/llvm/MC/MCFragment.h
  llvm/include/llvm/MC/MCObjectStreamer.h
  llvm/lib/MC/MCAssembler.cpp
  llvm/lib/MC/MCFragment.cpp
  llvm/lib/MC/MCObjectStreamer.cpp
  llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
  llvm/test/MC/X86/align-branch-32-1a.s
  llvm/test/MC/X86/align-branch-64-1a.s
  llvm/test/MC/X86/align-branch-64-1b.s
  llvm/test/MC/X86/align-branch-64-1c.s
  llvm/test/MC/X86/align-branch-64-1d.s
  llvm/test/MC/X86/align-branch-64-2a.s
  llvm/test/MC/X86/align-branch-64-2b.s
  llvm/test/MC/X86/align-branch-64-2c.s
  llvm/test/MC/X86/align-branch-64-3a.s
  llvm/test/MC/X86/align-branch-64-4a.s
  llvm/test/MC/X86/align-branch-64-5a.s
  llvm/test/MC/X86/align-branch-64-5b.s

Index: llvm/test/MC/X86/align-branch-64-5b.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5b.s
@@ -0,0 +1,50 @@
+# Check option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret can cowork with option --mc-relax-all
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret --mc-relax-all %s | llvm-objdump -d  - > %t1
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:8: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   10: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   18: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   1b: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   1d: 90   nop
+# CHECK-NEXT:   1e: 90   nop
+# CHECK-NEXT:   1f: 90   nop
+# CHECK-NEXT:   20: 0f 85 f5 ff ff ffjne {{.*}}
+# CHECK-NEXT:   26: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   2e: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   36: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   39: 0f 85 e7 ff ff ffjne {{.*}}
+# CHECK-NEXT:   3f: 90   nop
+# CHECK-NEXT:   40: e9 d6 ff ff ff   jmp {{.*}}
+# CHECK-NEXT:   45: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   4d: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   55: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   5d: 90   nop
+# CHECK-NEXT:   5e: 90   nop
+# CHECK-NEXT:   5f: 90   nop
+# CHECK-NEXT:   60: e8 9b ff ff ff   callq   {{.*}}
+# CHECK-NEXT:   65: e9 bc ff ff ff   jmp {{.*}}
+.text
+.p2align 4
+foo:
+  .rept 3
+  movl  %eax, %fs:0x1
+  .endr
+  shrl  $2, %ecx
+.L1:
+  movl  %edx, %ecx
+  jne   .L1
+.L2:
+  .rept 2
+  movl  %eax, %fs:0x1
+  .endr
+  testb $2, %dl
+  jne   .L2
+  jmp   .L1
+  .rept 3
+  movl  %eax, %fs:0x1
+  .endr
+  call  foo
+  jmp   .L2
Index: llvm/test/MC/X86/align-branch-64-5a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5a.s
@@ -0,0 +1,43 @@
+# Check no nop is inserted if no branch cross or is against the boundary
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret  %s | llvm-objdump -d  - > %t1
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t2
+# RUN: cmp %t1 %t2
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-COUNT-3:  : 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK:18: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   1b: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   1d: 75 fcjne {{.*}}
+# CHECK-NEXT:   1f: 55   pushq   %rbp
+# CHECK-NEXT:   20: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   23: 75 fajne {{.*}}
+# CHECK-COUNT-2:  : 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK:35: c1 e9 02 shrl   

[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-20 Thread Kan Shengchen via Phabricator via cfe-commits
skan updated this revision to Diff 234846.
skan added a comment.

1. Add more tests for `!VK_NONE` cases.
2. Reduce pervasive `auto`


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157

Files:
  llvm/include/llvm/MC/MCAsmBackend.h
  llvm/include/llvm/MC/MCAssembler.h
  llvm/include/llvm/MC/MCFragment.h
  llvm/include/llvm/MC/MCObjectStreamer.h
  llvm/lib/MC/MCAssembler.cpp
  llvm/lib/MC/MCFragment.cpp
  llvm/lib/MC/MCObjectStreamer.cpp
  llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
  llvm/test/MC/X86/align-branch-32-1a.s
  llvm/test/MC/X86/align-branch-64-1a.s
  llvm/test/MC/X86/align-branch-64-1b.s
  llvm/test/MC/X86/align-branch-64-1c.s
  llvm/test/MC/X86/align-branch-64-1d.s
  llvm/test/MC/X86/align-branch-64-2a.s
  llvm/test/MC/X86/align-branch-64-2b.s
  llvm/test/MC/X86/align-branch-64-2c.s
  llvm/test/MC/X86/align-branch-64-3a.s
  llvm/test/MC/X86/align-branch-64-4a.s
  llvm/test/MC/X86/align-branch-64-5a.s
  llvm/test/MC/X86/align-branch-64-5b.s

Index: llvm/test/MC/X86/align-branch-64-5b.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5b.s
@@ -0,0 +1,50 @@
+# Check option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret can cowork with option --mc-relax-all
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret --mc-relax-all %s | llvm-objdump -d  - > %t1
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:8: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   10: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   18: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   1b: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   1d: 90   nop
+# CHECK-NEXT:   1e: 90   nop
+# CHECK-NEXT:   1f: 90   nop
+# CHECK-NEXT:   20: 0f 85 f5 ff ff ffjne {{.*}}
+# CHECK-NEXT:   26: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   2e: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   36: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   39: 0f 85 e7 ff ff ffjne {{.*}}
+# CHECK-NEXT:   3f: 90   nop
+# CHECK-NEXT:   40: e9 d6 ff ff ff   jmp {{.*}}
+# CHECK-NEXT:   45: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   4d: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   55: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   5d: 90   nop
+# CHECK-NEXT:   5e: 90   nop
+# CHECK-NEXT:   5f: 90   nop
+# CHECK-NEXT:   60: e8 9b ff ff ff   callq   {{.*}}
+# CHECK-NEXT:   65: e9 bc ff ff ff   jmp {{.*}}
+.text
+.p2align 4
+foo:
+  .rept 3
+  movl  %eax, %fs:0x1
+  .endr
+  shrl  $2, %ecx
+.L1:
+  movl  %edx, %ecx
+  jne   .L1
+.L2:
+  .rept 2
+  movl  %eax, %fs:0x1
+  .endr
+  testb $2, %dl
+  jne   .L2
+  jmp   .L1
+  .rept 3
+  movl  %eax, %fs:0x1
+  .endr
+  call  foo
+  jmp   .L2
Index: llvm/test/MC/X86/align-branch-64-5a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5a.s
@@ -0,0 +1,43 @@
+# Check no nop is inserted if no branch cross or is against the boundary
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret  %s | llvm-objdump -d  - > %t1
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t2
+# RUN: cmp %t1 %t2
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-COUNT-3:  : 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK:18: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   1b: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   1d: 75 fcjne {{.*}}
+# CHECK-NEXT:   1f: 55   pushq   %rbp
+# CHECK-NEXT:   20: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   23: 75 fajne {{.*}}
+# CHECK-COUNT-2:  : 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK:35: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   38: e8 c3 ff ff ff   callq   {{.*}}
+# CHECK-NEXT:   3d: ff e0jmpq*%rax
+# CHECK-NEXT:  

[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-20 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added a comment.

In D70157#1792262 , @skan wrote:

> Do you think this patch is ready to land? @MaskRay


It is 00:35 here and I should confess that I haven't read through this yet.

There are some minor things like (1) pervasive `auto` (2) `report_fatal_error` 
in `X86AlignBranchKind` is not suitable. I expect the error is reported at a 
command line parsing stage. (3) I don't see how `__tls_get_addr` is referenced 
in code but somehow it magically appears in tests. It may be related to 
`VK_NONE` but there should be more tests for `!VK_NONE` cases. (4) More 
function-level comments are needed.

If it is not super urgent, probably want a couple of hours to get a response 
from @reames or @jyknight whether they think this can be committed as is, with 
iteration work in the future?

(Also note that @fedor.sergeev has requested changes. The convention is to wait 
for @fedor.sergeev to agree that this can be accepted... but there are vacation 
schedules involved.)




Comment at: llvm/lib/MC/MCAssembler.cpp:1009
+  AlignedOffset -= OldSize;
+  auto BoundaryAlignment = BF.getAlignment();
+  uint64_t NewSize = needPadding(AlignedOffset, AlignedSize, BoundaryAlignment)

`Align`


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-20 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

Do you think this patch is ready to land? @MaskRay


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-20 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added inline comments.



Comment at: llvm/lib/MC/MCAssembler.cpp:1003
+  // exists) also marks the end of the branch.
+  for (auto i = 0U, N = BF.isFused() ? 2U : 1U;
+   i != N && !isa(F); ++i, F = F->getNextNode()) {

skan wrote:
> MaskRay wrote:
> > unsigned -> int
> Why use `int` here?
The loop variable can only be 0 or 1. 0U 1U 2U the unsigned suffix are just 
redundant. `int` suffices.  It also improves readability a bit by avoiding 
`auto`.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-19 Thread Kan Shengchen via Phabricator via cfe-commits
skan updated this revision to Diff 234830.
skan added a comment.

1. Simplify the test cases.
2. Add some comments


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157

Files:
  llvm/include/llvm/MC/MCAsmBackend.h
  llvm/include/llvm/MC/MCAssembler.h
  llvm/include/llvm/MC/MCFragment.h
  llvm/include/llvm/MC/MCObjectStreamer.h
  llvm/lib/MC/MCAssembler.cpp
  llvm/lib/MC/MCFragment.cpp
  llvm/lib/MC/MCObjectStreamer.cpp
  llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
  llvm/test/MC/X86/align-branch-32-1a.s
  llvm/test/MC/X86/align-branch-64-1a.s
  llvm/test/MC/X86/align-branch-64-1b.s
  llvm/test/MC/X86/align-branch-64-1c.s
  llvm/test/MC/X86/align-branch-64-1d.s
  llvm/test/MC/X86/align-branch-64-2a.s
  llvm/test/MC/X86/align-branch-64-2b.s
  llvm/test/MC/X86/align-branch-64-2c.s
  llvm/test/MC/X86/align-branch-64-3a.s
  llvm/test/MC/X86/align-branch-64-4a.s
  llvm/test/MC/X86/align-branch-64-5a.s
  llvm/test/MC/X86/align-branch-64-5b.s

Index: llvm/test/MC/X86/align-branch-64-5b.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5b.s
@@ -0,0 +1,50 @@
+# Check option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret can cowork with option --mc-relax-all
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret --mc-relax-all %s | llvm-objdump -d  - > %t1
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:8: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   10: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   18: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   1b: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   1d: 90   nop
+# CHECK-NEXT:   1e: 90   nop
+# CHECK-NEXT:   1f: 90   nop
+# CHECK-NEXT:   20: 0f 85 f5 ff ff ffjne {{.*}}
+# CHECK-NEXT:   26: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   2e: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   36: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   39: 0f 85 e7 ff ff ffjne {{.*}}
+# CHECK-NEXT:   3f: 90   nop
+# CHECK-NEXT:   40: e9 d6 ff ff ff   jmp {{.*}}
+# CHECK-NEXT:   45: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   4d: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   55: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:   5d: 90   nop
+# CHECK-NEXT:   5e: 90   nop
+# CHECK-NEXT:   5f: 90   nop
+# CHECK-NEXT:   60: e8 9b ff ff ff   callq   {{.*}}
+# CHECK-NEXT:   65: e9 bc ff ff ff   jmp {{.*}}
+.text
+.p2align 4
+foo:
+  .rept 3
+  movl  %eax, %fs:0x1
+  .endr
+  shrl  $2, %ecx
+.L1:
+  movl  %edx, %ecx
+  jne   .L1
+.L2:
+  .rept 2
+  movl  %eax, %fs:0x1
+  .endr
+  testb $2, %dl
+  jne   .L2
+  jmp   .L1
+  .rept 3
+  movl  %eax, %fs:0x1
+  .endr
+  call  foo
+  jmp   .L2
Index: llvm/test/MC/X86/align-branch-64-5a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5a.s
@@ -0,0 +1,43 @@
+# Check no nop is inserted if no branch cross or is against the boundary
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret  %s | llvm-objdump -d  - > %t1
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t2
+# RUN: cmp %t1 %t2
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-COUNT-3:  : 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK:18: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   1b: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   1d: 75 fcjne {{.*}}
+# CHECK-NEXT:   1f: 55   pushq   %rbp
+# CHECK-NEXT:   20: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   23: 75 fajne {{.*}}
+# CHECK-COUNT-2:  : 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK:35: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   38: e8 c3 ff ff ff   callq   {{.*}}
+# CHECK-NEXT:   3d: ff e0jmpq*%rax
+# CHECK-NEXT:   3f: 55   

[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-19 Thread Kan Shengchen via Phabricator via cfe-commits
skan marked an inline comment as done.
skan added inline comments.



Comment at: llvm/lib/MC/MCAssembler.cpp:1003
+  // exists) also marks the end of the branch.
+  for (auto i = 0U, N = BF.isFused() ? 2U : 1U;
+   i != N && !isa(F); ++i, F = F->getNextNode()) {

MaskRay wrote:
> unsigned -> int
Why use `int` here?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-19 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added inline comments.



Comment at: llvm/lib/MC/MCAssembler.cpp:993
+ MCBoundaryAlignFragment ) {
+  // The MCBoundaryAlignFragment that doesn't emit NOP should not relax.
+  if (!BF.canEmitNops())

be relaxed



Comment at: llvm/lib/MC/MCAssembler.cpp:997
+
+  auto AlignedOffset = Layout.getFragmentOffset(BF.getNextNode());
+  uint64_t AlignedSize = 0;

uint64_t



Comment at: llvm/lib/MC/MCAssembler.cpp:999
+  uint64_t AlignedSize = 0;
+  auto *F = BF.getNextNode();
+  // If the branch is unfused, it is emitted into one fragment, otherwise it is

`auto` -> `MCFragment`



Comment at: llvm/lib/MC/MCAssembler.cpp:1003
+  // exists) also marks the end of the branch.
+  for (auto i = 0U, N = BF.isFused() ? 2U : 1U;
+   i != N && !isa(F); ++i, F = F->getNextNode()) {

unsigned -> int



Comment at: llvm/lib/MC/MCAssembler.cpp:1007
+  }
+  auto OldSize = BF.getSize();
+  AlignedOffset -= OldSize;

uint64_t


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-19 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1791347 , @reames wrote:

> The general question is why a *range* of fragments can't be defined.  
> Computing the instruction size for the entire range then just requires 
> walking from first to last fragment in the range summing the size of each.  
> If both instructions are within the same data fragment, then no relaxation is 
> needed, and the size of both is simply the size of the data fragment.
>
> Unless I'm missing something here?


Thank you for your detailed explanation, the solution has been consistent with 
your suggestion now.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-19 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

> To be concrete, I propose:
>  ".autopad", ".noautopad": allow/disallow the assembler to emit padding via 
> inserting a nop or prefix before any instruction, as needed.
>  ".align_branch_boundary [N,] [instruction,]": Enable branch-boundary padding 
> (per previous description).

I had thought I sent the comments yesterday, but I didn't. :( I think my 
comments are aligned with the conclusion that Philip and Jame got. Wait for 
Jame's summary.

In general, I think it's a good idea to have generic directives to control the 
padding behavior in assembler. ".autopad", ".noautopad", 
".align_branch_boundary" looks good except it can't handle the nested 
scenarios. I can imagine the nested cases in assembly file which includes 
another assembly file. If we want to handle it, we need to have a pair of 
directives for each above. It will make the semantics complicated. We need 
elaborately design it.
I'd echo what Philip said. We want a more focused and basic implementation 
here. We're very open to have more discussion on generic directives. However, 
I'd prefer it's a separate topic from this one.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-19 Thread Philip Reames via Phabricator via cfe-commits
reames accepted this revision.
reames added a comment.

LGTM.  The patch isn't perfect, but this has reached the point where landing 
and iterating in tree is the fastest way to convergence.  To be clear, this 
LGTM *only* applies to the current patch contents, and the *internal only* flag 
names this introduces.  These may change before we expose this publicly.

Warts with the current patch we should iterate to address:

- Stylistic issues w.r.t. comments, naming, static vs member functions remain.  
None are show stoppers.
- Many of the tests can probably be simplified and condensed.
- The bundling scheme used is probably more complicated than needed (see 
previous suggestions).
- This patch doesn't include anyway to locally disable padding, and thus is 
*known incorrect* in some cases.  As this remains off by default, this is not a 
blocker for commit.

p.s. I spoke w/James this morning, and we came up with some revisions in 
approach he's going to suggest separately.  We did explicitly discuss the 
status of the current patch, and whether it needed further review cycles or 
could be iterated in tree.  Normally, I'd wait for him to summarize that 
conversation publicly, but given the time delay and vacation schedules 
involved, I want to avoid loosing another day cycle.




Comment at: llvm/include/llvm/MC/MCFragment.h:570
+private:
+  /// The size of the MCBoundaryAlignFragment.
+  uint64_t Size = 0;

Please add a note that the size is lazily set during relaxation, and is not 
meaningful before that.  



Comment at: llvm/test/MC/X86/align-branch-32-1a.s:34
+foo:
+  movl  %eax, %fs:0x1
+  pushl  %ebp

For testing alignment functionality, we can probably use a repl int3 pattern 
here.  It would make the tests much more concise, and shouldn't effect the 
logic being (currently) tested.  



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-19 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

In D70157#1789159 , @skan wrote:

> In D70157#1788445 , @reames wrote:
>
> > Specifically on the revised patch, I remain confused by the need for 
> > multiple subtypes.  The need for fragments *between* the potentially fused 
> > instructions doesn't make sense to me.  What I was expecting to see was the 
> > following:
> >  BoundaryAlign w/target=the branch fragment
> >  .. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
> >  the branch fragment
> >  a new data fragment if the branch fragment was a DF
> >
> > (i.e. a single BounaryAlign fragment which aligns a payload which is 
> > defined as "next fragment to target fragment inclusive".)
> >
> > To be specific, I'd expect to see the following for an example fused 
> > sequence:
> >
> > 1. BoundaryAlign w/Target = 3
> > 2. DataFragment containing TEST RAX, RAX
> > 3. RelaxeableFragment containing JNE symbo
> >
> >   Why do we need anything between the two fragments of the fused pair?
> >
> >   (As a reminder, I am new to this code.  If I'm missing the obvious, 
> > please just point it out.)
>
>
> JUMP is not always emiteed into `MCRelaxableFragment`, it also can be emitted 
> into `MCDataFragment` and  arithmetic ops with constant arguments of unknown 
> value (e.g. ADD,AND) can be emitted into `MCRelaxableFragment` ,  you can 
> find related code in `MCObjectStreamer::EmitInstructionImpl`, 
> `X86AsmBackend::mayNeedRelaxation`.  Let's say JCC is fused with TEST,  there 
> are four possible positions for JCC and CMP
>
> 1. JCC and CMP are in same `MCDataFragment`
> 2. JCC and CMP are in  two different `MCDataFragment`
> 3. JCC and CMP are in two different `MCRelaxableFragment`
> 4. JCC in a `MCRelaxableFragment`, CMP is in a `MCDataFragment`
>
>   and since `MCCompactEncodedInstFragment` is not applicable yet, i don't 
> what's its behaviour.
>
>   In order to compute the total size of CMP and JCC in 
> `MCAssembler::relaxBoundaryAlign`, I insert a `FusedJccSplit` to force CMP 
> and JCC in two fragments. Do you have any better idea?


I agree there are multiple cases here, see the original generic description 
instead of the specific example.  The general question is why a *range* of 
fragments can't be defined.  Computing the instruction size for the entire 
range then just requires walking from first to last fragment in the range 
summing the size of each.  If both instructions are within the same data 
fragment, then no relaxation is needed, and the size of both is simply the size 
of the data fragment.

Unless I'm missing something here?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-19 Thread Kan Shengchen via Phabricator via cfe-commits
skan updated this revision to Diff 234697.
skan added a comment.

Add more comment


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157

Files:
  llvm/include/llvm/MC/MCAsmBackend.h
  llvm/include/llvm/MC/MCAssembler.h
  llvm/include/llvm/MC/MCFragment.h
  llvm/include/llvm/MC/MCObjectStreamer.h
  llvm/lib/MC/MCAssembler.cpp
  llvm/lib/MC/MCFragment.cpp
  llvm/lib/MC/MCObjectStreamer.cpp
  llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
  llvm/test/MC/X86/align-branch-32-1a.s
  llvm/test/MC/X86/align-branch-64-1a.s
  llvm/test/MC/X86/align-branch-64-1b.s
  llvm/test/MC/X86/align-branch-64-1c.s
  llvm/test/MC/X86/align-branch-64-1d.s
  llvm/test/MC/X86/align-branch-64-2a.s
  llvm/test/MC/X86/align-branch-64-2b.s
  llvm/test/MC/X86/align-branch-64-2c.s
  llvm/test/MC/X86/align-branch-64-3a.s
  llvm/test/MC/X86/align-branch-64-4a.s
  llvm/test/MC/X86/align-branch-64-5a.s

Index: llvm/test/MC/X86/align-branch-64-5a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5a.s
@@ -0,0 +1,63 @@
+# Check no nop or prefix is inserted if no branch cross or is against the boundary
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret  %s | llvm-objdump -d  - > %t1
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t2
+# RUN: cmp %t1 %t2
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:3: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:6: 89 d1movl%edx, %ecx
+# CHECK-NEXT:8: 31 c0xorl%eax, %eax
+# CHECK-NEXT:a: 31 c8xorl%ecx, %eax
+# CHECK-NEXT:c: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:f: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   12: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   15: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   18: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   1b: f3 abrep stosl%eax, %es:(%rdi)
+# CHECK-NEXT:   1d: 75 e4jne {{.*}}
+# CHECK-NEXT:   1f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   21: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   24: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   27: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   2a: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   2c: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   2e: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   31: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   34: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   37: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   3a: e8 00 00 00 00   callq   {{.*}}
+# CHECK-NEXT:   3f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   41: 75 e1jne {{.*}}
+
+.text
+.p2align 4,,15
+foo:
+shrl$2, %ecx
+.L1:
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+xorl%ecx, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+rep stosl
+jne.L1
+xorl%eax, %eax
+shrl$2, %ecx
+.L2:
+shrl$2, %ecx
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+callbar
+xorl%eax, %eax
+jne.L2
Index: llvm/test/MC/X86/align-branch-64-4a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-4a.s
@@ -0,0 +1,63 @@
+# Check rets are not aligned with option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp %s | llvm-objdump -d  - >%t2
+# RUN: cmp %t %t2
+
+# Check only rets are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=ret
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=ret %s | llvm-objdump -d  - | FileCheck %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:8: 55

[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-18 Thread Kan Shengchen via Phabricator via cfe-commits
skan updated this revision to Diff 234650.
skan added a comment.

move the code that checks if we can reuse the current `MCBoundaryAlignFragment` 
 into the function `X86AsmBackend::getOrCreateBoundaryAlignFragment`


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157

Files:
  llvm/include/llvm/MC/MCAsmBackend.h
  llvm/include/llvm/MC/MCAssembler.h
  llvm/include/llvm/MC/MCFragment.h
  llvm/include/llvm/MC/MCObjectStreamer.h
  llvm/lib/MC/MCAssembler.cpp
  llvm/lib/MC/MCFragment.cpp
  llvm/lib/MC/MCObjectStreamer.cpp
  llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
  llvm/test/MC/X86/align-branch-32-1a.s
  llvm/test/MC/X86/align-branch-64-1a.s
  llvm/test/MC/X86/align-branch-64-1b.s
  llvm/test/MC/X86/align-branch-64-1c.s
  llvm/test/MC/X86/align-branch-64-1d.s
  llvm/test/MC/X86/align-branch-64-2a.s
  llvm/test/MC/X86/align-branch-64-2b.s
  llvm/test/MC/X86/align-branch-64-2c.s
  llvm/test/MC/X86/align-branch-64-3a.s
  llvm/test/MC/X86/align-branch-64-4a.s
  llvm/test/MC/X86/align-branch-64-5a.s

Index: llvm/test/MC/X86/align-branch-64-5a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5a.s
@@ -0,0 +1,63 @@
+# Check no nop or prefix is inserted if no branch cross or is against the boundary
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret  %s | llvm-objdump -d  - > %t1
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t2
+# RUN: cmp %t1 %t2
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:3: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:6: 89 d1movl%edx, %ecx
+# CHECK-NEXT:8: 31 c0xorl%eax, %eax
+# CHECK-NEXT:a: 31 c8xorl%ecx, %eax
+# CHECK-NEXT:c: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:f: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   12: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   15: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   18: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   1b: f3 abrep stosl%eax, %es:(%rdi)
+# CHECK-NEXT:   1d: 75 e4jne {{.*}}
+# CHECK-NEXT:   1f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   21: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   24: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   27: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   2a: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   2c: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   2e: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   31: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   34: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   37: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   3a: e8 00 00 00 00   callq   {{.*}}
+# CHECK-NEXT:   3f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   41: 75 e1jne {{.*}}
+
+.text
+.p2align 4,,15
+foo:
+shrl$2, %ecx
+.L1:
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+xorl%ecx, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+rep stosl
+jne.L1
+xorl%eax, %eax
+shrl$2, %ecx
+.L2:
+shrl$2, %ecx
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+callbar
+xorl%eax, %eax
+jne.L2
Index: llvm/test/MC/X86/align-branch-64-4a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-4a.s
@@ -0,0 +1,63 @@
+# Check rets are not aligned with option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp %s | llvm-objdump -d  - >%t2
+# RUN: cmp %t %t2
+
+# Check only rets are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=ret
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=ret %s | llvm-objdump -d  - | FileCheck %s
+
+# 

[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-18 Thread Kan Shengchen via Phabricator via cfe-commits
skan updated this revision to Diff 234520.
skan added a comment.

Fix a typo in `MCFragment::dump()`


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157

Files:
  llvm/include/llvm/MC/MCAsmBackend.h
  llvm/include/llvm/MC/MCAssembler.h
  llvm/include/llvm/MC/MCFragment.h
  llvm/include/llvm/MC/MCObjectStreamer.h
  llvm/lib/MC/MCAssembler.cpp
  llvm/lib/MC/MCFragment.cpp
  llvm/lib/MC/MCObjectStreamer.cpp
  llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
  llvm/test/MC/X86/align-branch-32-1a.s
  llvm/test/MC/X86/align-branch-64-1a.s
  llvm/test/MC/X86/align-branch-64-1b.s
  llvm/test/MC/X86/align-branch-64-1c.s
  llvm/test/MC/X86/align-branch-64-1d.s
  llvm/test/MC/X86/align-branch-64-2a.s
  llvm/test/MC/X86/align-branch-64-2b.s
  llvm/test/MC/X86/align-branch-64-2c.s
  llvm/test/MC/X86/align-branch-64-3a.s
  llvm/test/MC/X86/align-branch-64-4a.s
  llvm/test/MC/X86/align-branch-64-5a.s

Index: llvm/test/MC/X86/align-branch-64-5a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5a.s
@@ -0,0 +1,63 @@
+# Check no nop or prefix is inserted if no branch cross or is against the boundary
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret  %s | llvm-objdump -d  - > %t1
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t2
+# RUN: cmp %t1 %t2
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:3: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:6: 89 d1movl%edx, %ecx
+# CHECK-NEXT:8: 31 c0xorl%eax, %eax
+# CHECK-NEXT:a: 31 c8xorl%ecx, %eax
+# CHECK-NEXT:c: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:f: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   12: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   15: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   18: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   1b: f3 abrep stosl%eax, %es:(%rdi)
+# CHECK-NEXT:   1d: 75 e4jne {{.*}}
+# CHECK-NEXT:   1f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   21: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   24: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   27: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   2a: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   2c: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   2e: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   31: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   34: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   37: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   3a: e8 00 00 00 00   callq   {{.*}}
+# CHECK-NEXT:   3f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   41: 75 e1jne {{.*}}
+
+.text
+.p2align 4,,15
+foo:
+shrl$2, %ecx
+.L1:
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+xorl%ecx, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+rep stosl
+jne.L1
+xorl%eax, %eax
+shrl$2, %ecx
+.L2:
+shrl$2, %ecx
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+callbar
+xorl%eax, %eax
+jne.L2
Index: llvm/test/MC/X86/align-branch-64-4a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-4a.s
@@ -0,0 +1,63 @@
+# Check rets are not aligned with option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp %s | llvm-objdump -d  - >%t2
+# RUN: cmp %t %t2
+
+# Check only rets are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=ret
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=ret %s | llvm-objdump -d  - | FileCheck %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# 

[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-18 Thread Kan Shengchen via Phabricator via cfe-commits
skan updated this revision to Diff 234515.
skan added a comment.

1. rename `MCBoundaryAlignFragment::hasEmitNop()` to 
`MCBoundaryAlignFragment::canEmitNop()`

2. reduce the number of `MCBoundaryAlignFragment` emitted as possible


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157

Files:
  llvm/include/llvm/MC/MCAsmBackend.h
  llvm/include/llvm/MC/MCAssembler.h
  llvm/include/llvm/MC/MCFragment.h
  llvm/include/llvm/MC/MCObjectStreamer.h
  llvm/lib/MC/MCAssembler.cpp
  llvm/lib/MC/MCFragment.cpp
  llvm/lib/MC/MCObjectStreamer.cpp
  llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
  llvm/test/MC/X86/align-branch-32-1a.s
  llvm/test/MC/X86/align-branch-64-1a.s
  llvm/test/MC/X86/align-branch-64-1b.s
  llvm/test/MC/X86/align-branch-64-1c.s
  llvm/test/MC/X86/align-branch-64-1d.s
  llvm/test/MC/X86/align-branch-64-2a.s
  llvm/test/MC/X86/align-branch-64-2b.s
  llvm/test/MC/X86/align-branch-64-2c.s
  llvm/test/MC/X86/align-branch-64-3a.s
  llvm/test/MC/X86/align-branch-64-4a.s
  llvm/test/MC/X86/align-branch-64-5a.s

Index: llvm/test/MC/X86/align-branch-64-5a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5a.s
@@ -0,0 +1,63 @@
+# Check no nop or prefix is inserted if no branch cross or is against the boundary
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret  %s | llvm-objdump -d  - > %t1
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t2
+# RUN: cmp %t1 %t2
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:3: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:6: 89 d1movl%edx, %ecx
+# CHECK-NEXT:8: 31 c0xorl%eax, %eax
+# CHECK-NEXT:a: 31 c8xorl%ecx, %eax
+# CHECK-NEXT:c: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:f: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   12: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   15: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   18: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   1b: f3 abrep stosl%eax, %es:(%rdi)
+# CHECK-NEXT:   1d: 75 e4jne {{.*}}
+# CHECK-NEXT:   1f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   21: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   24: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   27: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   2a: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   2c: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   2e: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   31: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   34: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   37: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   3a: e8 00 00 00 00   callq   {{.*}}
+# CHECK-NEXT:   3f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   41: 75 e1jne {{.*}}
+
+.text
+.p2align 4,,15
+foo:
+shrl$2, %ecx
+.L1:
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+xorl%ecx, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+rep stosl
+jne.L1
+xorl%eax, %eax
+shrl$2, %ecx
+.L2:
+shrl$2, %ecx
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+callbar
+xorl%eax, %eax
+jne.L2
Index: llvm/test/MC/X86/align-branch-64-4a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-4a.s
@@ -0,0 +1,63 @@
+# Check rets are not aligned with option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp %s | llvm-objdump -d  - >%t2
+# RUN: cmp %t %t2
+
+# Check only rets are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=ret
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=ret %s | llvm-objdump -d  - | 

[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-18 Thread Kan Shengchen via Phabricator via cfe-commits
skan updated this revision to Diff 234488.
skan edited the summary of this revision.
skan added a comment.

**Simplify**

Drop the subtype of `MCBoundaryAlignFragment` and add data member `EmitNops` to 
indicate whether NOPs should be emitted.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157

Files:
  llvm/include/llvm/MC/MCAsmBackend.h
  llvm/include/llvm/MC/MCAssembler.h
  llvm/include/llvm/MC/MCFragment.h
  llvm/include/llvm/MC/MCObjectStreamer.h
  llvm/lib/MC/MCAssembler.cpp
  llvm/lib/MC/MCFragment.cpp
  llvm/lib/MC/MCObjectStreamer.cpp
  llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
  llvm/test/MC/X86/align-branch-32-1a.s
  llvm/test/MC/X86/align-branch-64-1a.s
  llvm/test/MC/X86/align-branch-64-1b.s
  llvm/test/MC/X86/align-branch-64-1c.s
  llvm/test/MC/X86/align-branch-64-1d.s
  llvm/test/MC/X86/align-branch-64-2a.s
  llvm/test/MC/X86/align-branch-64-2b.s
  llvm/test/MC/X86/align-branch-64-2c.s
  llvm/test/MC/X86/align-branch-64-3a.s
  llvm/test/MC/X86/align-branch-64-4a.s
  llvm/test/MC/X86/align-branch-64-5a.s

Index: llvm/test/MC/X86/align-branch-64-5a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5a.s
@@ -0,0 +1,63 @@
+# Check no nop or prefix is inserted if no branch cross or is against the boundary
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret  %s | llvm-objdump -d  - > %t1
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t2
+# RUN: cmp %t1 %t2
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:3: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:6: 89 d1movl%edx, %ecx
+# CHECK-NEXT:8: 31 c0xorl%eax, %eax
+# CHECK-NEXT:a: 31 c8xorl%ecx, %eax
+# CHECK-NEXT:c: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:f: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   12: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   15: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   18: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   1b: f3 abrep stosl%eax, %es:(%rdi)
+# CHECK-NEXT:   1d: 75 e4jne {{.*}}
+# CHECK-NEXT:   1f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   21: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   24: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   27: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   2a: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   2c: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   2e: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   31: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   34: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   37: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   3a: e8 00 00 00 00   callq   {{.*}}
+# CHECK-NEXT:   3f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   41: 75 e1jne {{.*}}
+
+.text
+.p2align 4,,15
+foo:
+shrl$2, %ecx
+.L1:
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+xorl%ecx, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+rep stosl
+jne.L1
+xorl%eax, %eax
+shrl$2, %ecx
+.L2:
+shrl$2, %ecx
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+callbar
+xorl%eax, %eax
+jne.L2
Index: llvm/test/MC/X86/align-branch-64-4a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-4a.s
@@ -0,0 +1,63 @@
+# Check rets are not aligned with option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp %s | llvm-objdump -d  - >%t2
+# RUN: cmp %t %t2
+
+# Check only rets are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=ret
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=ret %s | llvm-objdump -d 

[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-17 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1788445 , @reames wrote:

> Specifically on the revised patch, I remain confused by the need for multiple 
> subtypes.  The need for fragments *between* the potentially fused 
> instructions doesn't make sense to me.  What I was expecting to see was the 
> following:
>  BoundaryAlign w/target=the branch fragment
>  .. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
>  the branch fragment
>  a new data fragment if the branch fragment was a DF
>
> (i.e. a single BounaryAlign fragment which aligns a payload which is defined 
> as "next fragment to target fragment inclusive".)
>
> To be specific, I'd expect to see the following for an example fused sequence:
>
> 1. BoundaryAlign w/Target = 3
> 2. DataFragment containing TEST RAX, RAX
> 3. RelaxeableFragment containing JNE symbo
>
>   Why do we need anything between the two fragments of the fused pair?
>
>   (As a reminder, I am new to this code.  If I'm missing the obvious, please 
> just point it out.)


JUMP is not always emiteed into `MCRelaxableFragment`, it also can be emitted 
into `MCDataFragment` and  arithmetic ops with constant arguments of unknown 
value (e.g. ADD,AND) can be emitted into 'MCRelaxableFragment' ,  you can find 
related code in `MCObjectStreamer::EmitInstructionImpl`, 
'X86AsmBackend::mayNeedRelaxation'.  Let's say JCC is fused with TEST,  there 
are four possible positions for JCC and CMP

1. JCC and CMP are in same `MCDataFragment`
2. JCC and CMP are in  two different `MCDataFragment`
3. JCC and CMP are in two different `MCRelaxableFragment`
4. JCC in a `MCRelaxableFragment`, CMP is in a `MCDataFragment`

and since `MCCompactEncodedInstFragment` is not applicable yet, i don't what's 
its behaviour.

In order to compute the total size of CMP and JCC in 
`MCAssembler::relaxBoundaryAlign`, I insert a `FusedJccSplit` to force CMP and 
JCC in two fragments. Do you have any better idea?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-17 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

I've updated the draft assembler support in https://reviews.llvm.org/D71315 to 
match the proposal here.  Again, this is very much WIP and mostly just to show 
proposed syntax/usage.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-17 Thread James Y Knight via Phabricator via cfe-commits
jyknight added a comment.

In D70157#1788418 , @reames wrote:

> In D70157#1788025 , @jyknight wrote:
>
> > > .push_align_branch_boundary [N,] [instruction,]*
> >
> > I'd like to raise again the possibility of using a more general region 
> > directive to denote "It is allowable to add prefixes/nops before 
> > instructions in this region if the assembler wants to", as I'd started 
> > discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the 
> > discussion here).
>
>
> James, I think this proposal is increasing the scope of this proposal too 
> much.  It also ignores some of the use cases identified and described in the 
> writeup (i.e. the scoped semantics).  I'm open to discussing such a feature 
> more generally, but I'd prefer to see a more narrowly focused feature 
> immediately.


I do not intend that we expand the scope of the project to include any of the 
other features.

All I want is to slightly consider surrounding features when adding the new 
assembly syntax. The situations where we want to avoid modifying a certain 
block of code are extremely likely to apply to //any// 
nop-or-prefix-introducing code modifications -- not just modifications 
resulting from branch alignment. So if we can make the directives annotating 
where such changes are allowable (and conversely, where they are not) 
generally-applicable, with a very minimal amount of work, that would be nice.

I also don't understand what you mean by "it ignores [...] scoped semantics"?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-17 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

Specifically on the revised patch, I remain confused by the need for multiple 
subtypes.  The need for fragments *between* the potentially fused instructions 
doesn't make sense to me.  What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as 
"next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

1. BoundaryAlign w/Target = 3
2. DataFragment containing TEST RAX, RAX
3. RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code.  If I'm missing the obvious, please just 
point it out.)


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-17 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

In D70157#1788025 , @jyknight wrote:

> > .push_align_branch_boundary [N,] [instruction,]*
>
> I'd like to raise again the possibility of using a more general region 
> directive to denote "It is allowable to add prefixes/nops before instructions 
> in this region if the assembler wants to", as I'd started discussing in 
> https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).


James, I think this proposal is increasing the scope of this proposal too much. 
 It also ignores some of the use cases identified and described in the writeup 
(i.e. the scoped semantics).  I'm open to discussing such a feature more 
generally, but I'd prefer to see a more narrowly focused feature immediately.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-17 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

In D70157#1787403 , @annita.zhang 
wrote:

> In D70157#1787160 , @MaskRay wrote:
>
> >
>
>
>
>
> > There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). 
> > With .push_align_branch/.pop_align_branch, we probably don't need the 
> > 'switch-to-default' action.
> > 
> > I don't know how likely we may ever need nested states (e.g. an `.include` 
> > directive inside an .align_branch region where the included file has own 
> > idea about branch alignment), but .push/.pop does not seem to be more 
> > complex than disable/enable/default.
>
> I rethink about the directives and prefer the .push/.pop pair as @MaskRay 
> suggested. To be specified, I'd suggest to use .push_align_branch_boundary 
> and .pop_align_branch_boundary to align with MC command line options. They 
> will cowork with the command line options and overwrite the options if both 
> are existing.


I agree that we need the push/pop semantics.

> To be clarified, I described the behavior of the directives from my 
> understanding. Feel free to speak if you have difference opinion.
> 
> .push_align_branch_boundary [N,] [instruction,]*
> 
>   This directive specifies the beginning of a region which will overwrite the 
> value set by the command line or by the previous directive. It can represent 
> either an enabling or disabling directive controlled by parameter N. 
>   N indicates to align the branches within N byte boundary. The default value 
> is 32. If N is 0, it means the branch alignment is off within this region. 
>   Instruction specifies types of branches to align. The value is one or 
> multiple values from fused, jcc, jmp, call, ret and indirect. The default 
> value is fused, jcc and jmp. (may change later)

I'd remove the defaults.  Let's just be explicit about what is being 
enabled/disabled.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary(NOP padding)

2019-12-17 Thread Kan Shengchen via Phabricator via cfe-commits
skan updated this revision to Diff 234312.
skan retitled this revision from "Align branches within 32-Byte boundary" to 
"Align branches within 32-Byte boundary(NOP padding)".
skan edited the summary of this revision.
skan added a comment.

**Simplify**

1. Drop prefix padding support
2. Drop clang driver support
3. Drop default align option `-x86-branches-within-32B-boundaries`
4. Drop hardcode support

**Other**

1. Throw an error if an illegal value is passed to option `-x86-align-branch`
2. Use `llvm::Align` instead of `uint64_t` to store the information about 
boundary
3. Remove test cases for prefix padding and add more test cases for NOP padding
4. Rename `MCMachineDependentFragment` to `MCBoundaryAlignFragment`


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157

Files:
  llvm/include/llvm/MC/MCAsmBackend.h
  llvm/include/llvm/MC/MCAssembler.h
  llvm/include/llvm/MC/MCFragment.h
  llvm/include/llvm/MC/MCObjectStreamer.h
  llvm/lib/MC/MCAssembler.cpp
  llvm/lib/MC/MCFragment.cpp
  llvm/lib/MC/MCObjectStreamer.cpp
  llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
  llvm/test/MC/X86/align-branch-32-1a.s
  llvm/test/MC/X86/align-branch-64-1a.s
  llvm/test/MC/X86/align-branch-64-1b.s
  llvm/test/MC/X86/align-branch-64-1c.s
  llvm/test/MC/X86/align-branch-64-1d.s
  llvm/test/MC/X86/align-branch-64-2a.s
  llvm/test/MC/X86/align-branch-64-2b.s
  llvm/test/MC/X86/align-branch-64-2c.s
  llvm/test/MC/X86/align-branch-64-3a.s
  llvm/test/MC/X86/align-branch-64-4a.s
  llvm/test/MC/X86/align-branch-64-5a.s

Index: llvm/test/MC/X86/align-branch-64-5a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-5a.s
@@ -0,0 +1,63 @@
+# Check no nop or prefix is inserted if no branch cross or is against the boundary
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret  %s | llvm-objdump -d  - > %t1
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s | llvm-objdump -d  - > %t2
+# RUN: cmp %t1 %t2
+# RUN: FileCheck --input-file=%t1 %s
+
+# CHECK:  foo:
+# CHECK-NEXT:0: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:3: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:6: 89 d1movl%edx, %ecx
+# CHECK-NEXT:8: 31 c0xorl%eax, %eax
+# CHECK-NEXT:a: 31 c8xorl%ecx, %eax
+# CHECK-NEXT:c: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:f: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   12: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   15: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   18: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   1b: f3 abrep stosl%eax, %es:(%rdi)
+# CHECK-NEXT:   1d: 75 e4jne {{.*}}
+# CHECK-NEXT:   1f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   21: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   24: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   27: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   2a: 89 d1movl%edx, %ecx
+# CHECK-NEXT:   2c: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   2e: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   31: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   34: c1 e9 02 shrl$2, %ecx
+# CHECK-NEXT:   37: f6 c2 02 testb   $2, %dl
+# CHECK-NEXT:   3a: e8 00 00 00 00   callq   {{.*}}
+# CHECK-NEXT:   3f: 31 c0xorl%eax, %eax
+# CHECK-NEXT:   41: 75 e1jne {{.*}}
+
+.text
+.p2align 4,,15
+foo:
+shrl$2, %ecx
+.L1:
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+xorl%ecx, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+rep stosl
+jne.L1
+xorl%eax, %eax
+shrl$2, %ecx
+.L2:
+shrl$2, %ecx
+shrl$2, %ecx
+movl%edx, %ecx
+xorl%eax, %eax
+shrl$2, %ecx
+shrl$2, %ecx
+shrl$2, %ecx
+testb$2, %dl
+callbar
+xorl%eax, %eax
+jne.L2
Index: llvm/test/MC/X86/align-branch-64-4a.s
===
--- /dev/null
+++ llvm/test/MC/X86/align-branch-64-4a.s
@@ -0,0 +1,63 @@
+# Check rets are not aligned with option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp
+# RUN: 

[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-17 Thread James Y Knight via Phabricator via cfe-commits
jyknight added a comment.

> .push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region 
directive to denote "It is allowable to add prefixes/nops before instructions 
in this region if the assembler wants to", as I'd started discussing in 
https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

Whether this is OK or not on a particular piece of assembly-code is likely to 
be a generic property of the code, regardless of the purpose of the 
optimization. If we're going to have multiple assembler optimizations that can 
make use of this, it would be nice to express the "OK to pad" "not OK to pad" 
property only once, rather than once for each kind of optimization which might 
make such modifications.

In particular, I'd like to look ahead towards the potential implementation of 
two other features:

1. Allowing the assembler to prefix-pad instructions in order to avoid having 
to emit a NOP for p2align directives.
2. Allowing the assembler to do other instruction-padding performance 
optimizations to avoid other DSB cacheline limits.

To be concrete, I propose:
".autopad", ".noautopad": allow/disallow the assembler to emit padding via 
inserting a nop or prefix before any instruction, as needed.
".align_branch_boundary [N,] [instruction,]": Enable branch-boundary padding 
(per previous description).

In this scheme, I'd generally expect an ".align_branch_boundary" directive to 
be specified once at the beginning of the file, and ".autopad"/".noautopad" 
directives to be sprinkled throughout the file as required.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-17 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

In D70157#1787160 , @MaskRay wrote:

>




> There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). 
> With .push_align_branch/.pop_align_branch, we probably don't need the 
> 'switch-to-default' action.
> 
> I don't know how likely we may ever need nested states (e.g. an `.include` 
> directive inside an .align_branch region where the included file has own idea 
> about branch alignment), but .push/.pop does not seem to be more complex than 
> disable/enable/default.

I rethink about the directives and prefer the .push/.pop pair as @MaskRay 
suggested. To be specified, I'd suggest to use .push_align_branch_boundary and 
.pop_align_branch_boundary to align with MC command line options. They will 
cowork with the command line options and overwrite the options if both are 
existing.

To be clarified, I described the behavior of the directives from my 
understanding. Feel free to speak if you have difference opinion.

.push_align_branch_boundary [N,] [instruction,]*

  This directive specifies the beginning of a region which will overwrite the 
value set by the command line or by the previous directive. It can represent 
either an enabling or disabling directive controlled by parameter N. 
  N indicates to align the branches within N byte boundary. The default value 
is 32. If N is 0, it means the branch alignment is off within this region. 
  Instruction specifies types of branches to align. The value is one or 
multiple values from fused, jcc, jmp, call, ret and indirect. The default value 
is fused, jcc and jmp. (may change later)

.pop_align_branch_boundary

  This directive specifies the end of a region to align branch boundary. The 
status will be back to which was set by the previous directive or the one set 
by the command line if there's no previous directive existing.

I will coordinate with GNU binutils community once we discuss and have 
agreement with the directives.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-17 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1786901 , @reames wrote:

> Noting another issue we found in local testing (with an older version of this 
> patch).  This interacts badly with the implicit exception mechanism in LLVM.  
> For that mechanism, we end up generating assembly which looks more or less 
> like this:
> Ltmp:
>
>   cmp %rsi, (%rdi)
>   jcc 
>   
>
> And a side table which maps TLmp to another label so that a fault at Ltmp can 
> be interpreted as an extremely expensive branch via signal handler.
>
> The problem is that the auto-alignment of the fused branch causes padding to 
> be introduced which separate the label and the faulting exception, breaking 
> the mapping.
>
> Essentially, this comes down to an implicit assumption that the label stays 
> tightly bundled with the following instruction.
>
> This can happen with either nop or prefix padding.


How about insert NOP before the label `Ltmp`?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

In D70157#1787139 , @reames wrote:

> Here are the minutes from our phone call a few minutes ago.


Thanks for coordinating the meeting and having a clear summary. It helps a lot 
to accelerate the patch review. I really appreciate it!

> Annita will refresh current patch with two key changes.  1) Drop prefix 
> support and simplify and 2) drop clang driver support for now.  Desire is to 
> minimize cycle time before next iteration so that feedback on approach can be 
> given while reviewers are still around.

Yes, we are working on it right now. Hopefully we can submit a new patch today 
or tomorrow.

> Philip will prototype directive parsing support.  Annita and Yuo (??) to 
> handle coordination on syntax.

I suppose it's Annita and Fangrui

> Side note to Annita: For you to remove "hard code", you'll have to have a 
> placeholder for the enable/disable interface.  That should probably be split 
> and rebased in my patch.

Let's do it in your directive patch.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added a comment.

In D70157#1787139 , @reames wrote:

> Here are the minutes from our phone call a few minutes ago.
>
> Attendees: Andy Kaylor, Craig Topper, Annita Zhang, Tom Stellard, Chandler 
> Carruth, Fedor Sergeev, Philip Reames, Yuanake Luo


Thanks for organization the meeting and making the summary.

> Stawman syntax proposal
> 
> .align_branch_boundary disable/default
>  .align_branch_boundary enable N, instructions (fused, jcc, jmp, etc..)
> 
> ...
> 
> Push/Pop semantics were suggested at one point, but were thought to be 
> non-idiomatic?

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With 
.push_align_branch/.pop_align_branch, we probably don't need the 
'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an `.include` 
directive inside an .align_branch region where the included file has own idea 
about branch alignment), but .push/.pop does not seem to be more complex than 
disable/enable/default.

I confirm that the following 4 commits have been to the binutils-gdb repository 
(https://sourceware.org/ml/binutils/2019-12/msg00138.html 
https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=e379e5f385f874adb0b414f917adb1fc50e20de9).

  gas: Add md_generic_table_relax_frag
  i386: Align branches within a fixed boundary
  i386: Add -mbranches-within-32B-boundaries
  i386: Add tests for -malign-branch-boundary and -malign-branch


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

Here are the minutes from our phone call a few minutes ago.

Attendees: Andy Kaylor, Craig Topper, Annita Zhang, Tom Stellard, Chandler 
Carruth, Fedor Sergeev, Philip Reames, Yuanake Luo

Status Summary
==

Performance data has been posted to llvm-dev.  We had a side discussion about 
nop encoding, and it was mentioned these numbers were collected from runs 
targeting skylake (i.e. not generic x86).  This is similar to the result we 
(Azul) have collected and shared summaries of previously.

GNU patch has landed Friday - this mostly fixes assembler command line.

Discussion on Approach
==

Three major options debated:

  Assembler only - as in the current patch, assembler does all work, only 
command line flag
  Explicit Directive only - as proposed in my alternate patch, compiler decides 
exactly what instructions get aligned
  Region based directives - as proposed in James' last comment on review, 
directives enable and disable auto-padding in assembler

Use cases identified:

  compiler users
  important than assembler is self contained (i.e. don't have to know 
compiler options for reproduceability)
  inline assembly looks a lot like assembler users
  legacy assembler
  important that existing assembly works unmodified
  assembler users
  "try it and see" model vs selective enable vs selective disable
  likely need to support all three

Consensus was that the region based directives met use cases the best.  In 
particular, desire to be able to overrule default (for say, inline assembly or 
a JITs patchability assumptions) and then restore default.  Default assembler 
behavior remains unchanged.

Stawman syntax proposal

.align_branch_boundary disable/default
.align_branch_boundary enable N, instructions (fused, jcc, jmp, etc..)

We need to ensure a consensus on syntax is shared w/gnu.  Annita agreed to 
coordinate this.

Compiler would essentially just wrap generated assembly in directives.

Issue noticed while writing this up: proposed syntax assumes a default has been 
set, but doesn't give a way to set one.  This would seem to break the desired 
reproducibility property for compiled code.  Revision needed.

Push/Pop semantics were suggested at one point, but were thought to be 
non-idiomatic?

Other Topics


We very quickly discussed nop vs prefix performance.  There was a clear 
consensus in starting with nop only patch and evolving later as needed.

Next Steps
==

Annita will refresh current patch with two key changes.  1) Drop prefix support 
and simplify and 2) drop clang driver support for now.  Desire is to minimize 
cycle time before next iteration so that feedback on approach can be given 
while reviewers are still around.

Philip will prototype directive parsing support.  Annita and Yuo (??) to handle 
coordination on syntax.

Suggested patch split:

  (current patch) command line option to set default, nop only version 
w/cleaned up code as much as possible
  assembler directive support (draft by Philip in parallel)
  (future) compiler patch to wrap by default

Side note to Annita: For you to remove "hard code", you'll have to have a 
placeholder for the enable/disable interface.  That should probably be split 
and rebased in my patch.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread Kan Shengchen via Phabricator via cfe-commits
skan marked an inline comment as done.
skan added inline comments.



Comment at: llvm/lib/MC/MCFragment.cpp:426
+  case MCFragment::FT_MachineDependent: {
+const MCMachineDependentFragment *MF =
+cast(this);

MaskRay wrote:
> `const auto *`. The type is obvious according to the right hand side.
Shall we keep consistent with the local code style?  `const MCLEBFragment *LF = 
cast(this);` was used here.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread Chandler Carruth via Phabricator via cfe-commits
chandlerc added a comment.

Just wanted to say thanks for the performance data! I know it was hard to get, 
but it is really, really useful to help folks evaluate these kinds of changes 
with actual data around the options available.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added a comment.

Haven't looked into too many details yet but made some suggestions anyway...




Comment at: llvm/include/llvm/MC/MCFragment.h:634
+
+  uint64_t getBoundarySize() const {
+return AlignBoundarySize;

Store AlignBoundarySize as a shift value, then `needPadding` doesn't even need 
to call Log2_64().



Comment at: llvm/lib/MC/MCFragment.cpp:426
+  case MCFragment::FT_MachineDependent: {
+const MCMachineDependentFragment *MF =
+cast(this);

`const auto *`. The type is obvious according to the right hand side.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:104
+  else if (BranchType == "indirect")
+addKind(AlignBranchIndirect);
+}

An unknown value is just ignored, e.g. `--x86-align-branch=unknown`. I think 
there should be an error, but I haven't looked into the patch detail to 
confidently suggest how we should surface this error.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:185
+  AlignMaxPrefixSize = 5;
+} else {
+  assert((X86AlignBranchBoundary == 0 || (X86AlignBranchBoundary >= 32 &&

> } else {

Should --x86-branches-within-32B-boundaries overwrite 
--x86-align-branch-boundary and --x86-align-branch and 
--x86-align-branch-prefix-size? My feeling is that it just provides a default 
value if either of the three options is not specified.

If you are going to remove `addKind` calls here, you can delete this member 
function.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:196
+}
+if(AlignBoundarySize != 0) {
+  AlignBoundarySize = Align(AlignBoundarySize).value();

space after if

No curly braces around simple statements.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:660
+  for (auto *F = CF; F && F != LastBranch; F = F->getPrevNode()) {
+bool IsMF = F->getKind() == MCFragment::FT_MachineDependent;
+// If there is a fragment that neither has instructions nor is

`isa(F)`




Comment at: llvm/lib/Target/X86/X86InstrInfo.td:1020
 def X86_COND_B   : PatLeaf<(i8 2)>;  // alt. COND_C
-def X86_COND_AE  : PatLeaf<(i8 3)>;  // alt. COND_NC
+def X86_COND_AE  : PatLeaf<(i8 3)>;  // alt. COND_NC,COND_NB
 def X86_COND_E   : PatLeaf<(i8 4)>;  // alt. COND_Z

Unintentional change not part of this patch?



Comment at: llvm/test/MC/X86/align-branch-32-1a.s:1
+# Check the macro-fusion table
+# RUN: llvm-mc -filetype=obj -triple i386-unknown-unknown 
--x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp  
--x86-align-branch-prefix-size=5 %s | llvm-objdump -d - | FileCheck %s

Did an older version include 32-1b.s or 32-1c.s? Now they are missing.



Comment at: llvm/test/MC/X86/align-branch-64-1b.s:1
+# Check only fused conditional jumps and conditional jumps are aligned with 
option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc  
--x86-align-branch-prefix-size=5
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown 
--x86-align-branch-boundary=32 --x86-align-branch=fused+jcc  
--x86-align-branch-prefix-size=5 %s | llvm-objdump -d  - | FileCheck %s

Create test/MC/X86/Inputs/align-branch-64-1.s and reference it from 1[a-d].s 
via %S/Inputs/align-branch-64-1.s


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

Noting another issue we found in local testing (with an older version of this 
patch).  This interacts badly with the implicit exception mechanism in LLVM.  
For that mechanism, we end up generating assembly which looks more or less like 
this:
Ltmp:

  cmp %rsi, (%rdi)
  jcc 

And a side table which maps TLmp to another label so that a fault at Ltmp can 
be interpreted as an extremely expensive branch via signal handler.

The problem is that the auto-alignment of the fused branch causes padding to be 
introduced which separate the label and the faulting exception, breaking the 
mapping.

Essentially, this comes down to an implicit assumption that the label stays 
tightly bundled with the following instruction.

This can happen with either nop or prefix padding.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

In offline discussion, there was an agreement that we needed further 
coordination to make sure this patch moves forward quickly.  For that reason, 
there will be a call happening today at 4pm Pacific.  Interested parties are 
welcome to attend.

Zoom Meeting ID: 507-497-8898
https://azul.zoom.us/j/5074978898

Results of the meeting will be summarized and posted here.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

> More performance data was posted on 
> http://lists.llvm.org/pipermail/llvm-dev/2019-December/137609.html and 
> http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html. Let's 
> move on based on the data.

Based on the SPEC data, we observed 2.6% and 1.3% performance effect in INTRATE 
and FPRATE geomean respectively. Performance effect on individual components 
were observed up to 5.1%.

The tool SW mitigation can recover the geomean to within 99% of the original 
performance with prefix padding to jcc+jmp+fused. The maximum performance loss 
was reduced to within 2.2% of the original one.

The prefix padding can provide better performance as 0.3%~0.5% in geomean than 
nop padding on system with micro update. In individual cases, we observed up to 
1.4% performance improvement in prefix padding. On a system w/o micro update, 
we observed 0.7% better performance of prefix padding on INTRATE geomean.

In this SPEC test, the prefix padding to jcc+jmp+fused and prefix padding to 
all branches has almost the same performance. However, we observed the latter 
prefix padding had a little bit better performance than the previous one in 
some cases at the cost of code size.

Since the performance delta in prefix padding and nop padding is incremental, 
starting from nop padding may be easier to implement as a first step, with 
additional prefix padding options to explore for additional performance 
optimizations.

And since the current update enables a relatively simple and general solution 
to mitigate JCC microcode update (MCU) performance effects to both C/C++ and 
Assembly without any source code change, we recommend it as a starting point. 
Then, the proposal for compiler to generate directives could enable further 
enhancements.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-16 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

In D70157#1768338 , @annita.zhang 
wrote:

> In D70157#1768319 , @chandlerc wrote:
>
> > I'm seeing lots of updates to fix bugs, but no movement for many days on 
> > both my meta comments and (in some ways more importantly) James's meta 
> > comments. (And thanks Philip for chiming in too!)
> >
> > Meanwhile, we really, really need to get this functionality in place. The 
> > entire story for minimizing the new microcode performance hit hinges on 
> > these patches, and I'm really worried by how little progress we're seeing 
> > here.
>
>
> Sorry for belated response. We're working hard to go through some paper work 
> to get the performance data ready. I think maybe it's better to open a 
> mailing thread in llvm-dev to post those performance data and discuss those 
> suggestions.
>
> The first data was posted in 
> http://lists.llvm.org/pipermail/llvm-dev/2019-December/137413.html.
>
> Thanks,
> Annita


More performance data was posted on 
http://lists.llvm.org/pipermail/llvm-dev/2019-December/137609.html and 
http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html. Let's move 
on based on the data.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-11 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.



In D70157#1777272 , @fedor.sergeev 
wrote:

> In D70157#1776424 , @skan wrote:
>
> > > What if I insert explicit align(8) right *after* the sequence?
> >
> > If your insert explicit `.align 8` after the sequence, and the sequence 
> > doesn't has any branch to be aligned, the current solution won't change the 
> > sequence.
>
>
> Well, I kinda figure that from the code behavior right now, however is it 
> really by design or just happens to work now?


It is by design, prefix won't be added to instruction if there is a `align` 
directive int the path to the target branch.

> Seeing that assembler becomes very intelligent now I would rather have a 
> strict guarantee similar to "hardcode" thing that you have here that protects 
> my sequence
>  rather than relying on a fact that in current settings moving my label by 8 
> does not appear to be profitable to assembler.

It depends on what your real need is. If you just need your original implicit 
alignment to work,  turning it into explicit alignment is enough.  Or if you 
need a sequence not changed by
the assembler at all,  maybe we need a new directive such as ".hard " or 
".bundle".


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-10 Thread Fedor Sergeev via Phabricator via cfe-commits
fedor.sergeev added a comment.

In D70157#1776424 , @skan wrote:

> > What if I insert explicit align(8) right *after* the sequence?
>
> If your insert explicit `.align 8` after the sequence, and the sequence 
> doesn't has any branch to be aligned, the current solution won't change the 
> sequence.


Well, I kinda figure that from the code behavior right now, however is it 
really by design or just happens to work now?
Seeing that assembler becomes very intelligent now I would rather have a strict 
guarantee similar to "hardcode" thing that you have here that protects my 
sequence
rather than relying on a fact that in current settings moving my label by 8 
does not appear to be profitable to assembler.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-09 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1775481 , @fedor.sergeev 
wrote:

> In D70157#1775016 , @annita.zhang 
> wrote:
>
> > > The point is that we have explicit requirement at the start and we have a 
> > > lowering into 16-byte sequence that we need to be preserved exactly as it 
> > > is.
> > >  Essentially what we need is  a "protection" for this sequence from any 
> > > changes by machinery that generates the binary code.
> > >  How can we protect a particular byte sequence from being changed by this 
> > > branch aligner?
> >
> > No, in general we can't. The current solution is based on assembler to 
> > insert prefix or nop before the cross (or against) boundary branches. It 
> > can only ensure the explicit alignment specified by directive, but not any 
> > implicit alignment. I don't think any fixup based on assembler can do it. 
> > On the other hand, any code sequence after the alignment directive or even 
> > just in a function has some kind of implicit alignment. It's hard for 
> > assembler to tell which implicit alignment to preserve. The preferred way 
> > is to use explicit alignment directive to specify it.
> >
> > For your scenario, a NOP padding is more controllable. NOP padding will be 
> > inserted just before the branch instructions (or macro fusion branch 
> > instructions). So if there's no branches (or macro fusion branches) in your 
> > code sequence, there will be no NOP inserted.
>
>
> What if I insert explicit align(8) right *after* the sequence?


If your insert explicit `.align 8` after the sequence, and the sequence doesn't 
has any branch to be aligned, the current solution won't change the sequence.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-09 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

I just posted an alternate review (https://reviews.llvm.org/D71238) which 
attempts to carve out a minimum reviewable piece of complexity.  The hope is 
that we can review that one quickly (as there are fewer interacting concerns), 
and then rebase this one (possibly splitting further).

I had previously suggested in review comments that we should reuse the 
infrastructure from .bundle_align_mode.  When I sat down to actually implement 
that, I discovered that the code for that has a bunch of interacting 
assumptions about when fragments are constructed and used vs alignment 
boundaries.  I got a version of this working, but the complexity was worrisome. 
 I now suggest that we should take the rough approach sketched here (a separate 
fragment before the one being aligned), delete the essentially unused bundle 
mode code, and revisit a unified representation if needed for memory density at 
a later time.  (i.e. my previous suggestion wasn't a good one)


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-09 Thread Fedor Sergeev via Phabricator via cfe-commits
fedor.sergeev added a comment.

In D70157#1775016 , @annita.zhang 
wrote:

> > The point is that we have explicit requirement at the start and we have a 
> > lowering into 16-byte sequence that we need to be preserved exactly as it 
> > is.
> >  Essentially what we need is  a "protection" for this sequence from any 
> > changes by machinery that generates the binary code.
> >  How can we protect a particular byte sequence from being changed by this 
> > branch aligner?
>
> No, in general we can't. The current solution is based on assembler to insert 
> prefix or nop before the cross (or against) boundary branches. It can only 
> ensure the explicit alignment specified by directive, but not any implicit 
> alignment. I don't think any fixup based on assembler can do it. On the other 
> hand, any code sequence after the alignment directive or even just in a 
> function has some kind of implicit alignment. It's hard for assembler to tell 
> which implicit alignment to preserve. The preferred way is to use explicit 
> alignment directive to specify it.
>
> For your scenario, a NOP padding is more controllable. NOP padding will be 
> inserted just before the branch instructions (or macro fusion branch 
> instructions). So if there's no branches (or macro fusion branches) in your 
> code sequence, there will be no NOP inserted.


What if I insert explicit align(8) right *after* the sequence?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-09 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

> The point is that we have explicit requirement at the start and we have a 
> lowering into 16-byte sequence that we need to be preserved exactly as it is.
>  Essentially what we need is  a "protection" for this sequence from any 
> changes by machinery that generates the binary code.
>  How can we protect a particular byte sequence from being changed by this 
> branch aligner?

No, we can't. The current solution is based on assembler to insert prefix or 
nop before the cross (or against) boundary branches. It can only ensure the 
explicit alignment specified by directive, but not any implicit alignment. I 
don't think any fixup based on assembler can do it. On the other hand, any code 
sequence after the alignment directive or even just in a function has some kind 
of implicit alignment. It's hard for assembler to tell which implicit alignment 
to preserve. The preferred way is to use explicit alignment directive to 
specify it.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-09 Thread Fedor Sergeev via Phabricator via cfe-commits
fedor.sergeev added a comment.

In D70157#1774561 , @skan wrote:

> I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with 
> your example. So according to my understanding, I slightly modified your 
> case. (If my understand is wrong, I hope you can point it out :-). )
>
>   .text
>   nop
>   .Ltmp0:
>   .p2align 3, 0x90
>   .rept 16
>   nop
>   .endr
>   .Ltmp3:
>   movl  %eax, -4(%rsp)


In our case it was

  andl $1, %eax

but it does not matter that much.

>   .rept 2
>   nop
>   .endr
>   jmp .Ltmp0
> 
>   The instruction `jmp .Ltmp0` starts at byte 0x1e and ends at byte 0x20.

Again, in our particular case start of the sequence was at xxx8, so 8 + 16(our 
sequence) + 3(andl) + 5(jmp) == 32.

> If we align the jump with preifx, two prefixes will be added to the `.rept2 
> 16 nop .endr`. After prefixes are added, the 16-byte nop becomes 18-byte nop, 
> then the label '.Ltmp3' is not 8-byte aligned any more.

Yes, thats what happened.

> I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, 
> since the alignment is not explicitly required.

The point is that we have explicit requirement at the start and we have a 
lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is  a "protection" for this sequence from any changes 
by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch 
aligner?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-08 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1771771 , @reames wrote:

> We uncovered another functional issue with this patch, or at least, the 
> interaction of this patch and other parts of LLVM.  In our support for 
> STATEPOINT, PATCHPOINT, and STACKMAP we use N-byte nop sequences for regions 
> of code which might be patched out.  It's important that these regions are 
> exactly N bytes as concurrent patching which doesn't replace an integral 
> number of instructions is ill-defined on X86-64.  This patch causes the 
> N-byte nop sequence to sometimes become (N+M) bytes which breaks the 
> patching.  I believe that the XRAY support may have a similar issue.
>
> More generally, I'm worried about the legality of arbitrarily prefixing 
> instructions from unknown sources.  In the particular example we saw, we had 
> something along the following:
>
> .Ltmp0:
>
>   .p2align3, 0x90
>   (16 byte nop sequence)
>
> .Ltmp3:
>
>   jmp *%rax
>
>
> In addition to the patching legality issue above, padding the nop sequence 
> does something else interesting in this example.  It changes the alignment of 
> Ltmp3.  Before, Ltmp3 was always 8 byte aligned, after prefixes are added, 
> it's not.  It's not clear to me exactly what the required semantics here are, 
> but we at least had been assuming the alignment of Ltmp3 was guaranteed in 
> this case.  (That's actually how we found the patching issue.)


I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with 
your example. So according to my understanding, I slightly modified your case. 
(If my understand is wrong, I hope you can point it out :-). )

  .text
  nop
  .Ltmp0:
  .p2align 3, 0x90
  .rept 16
  nop
  .endr
  .Ltmp3:
  movl  %eax, -4(%rsp)
  .rept 2
  nop
  .endr
  jmp .Ltmp0

The instruction `jmp .Ltmp0` starts at byte 0x1e and ends at byte 0x20. If we 
align the jump with preifx, two prefixes will be added to the `.rept2 16 nop 
.endr`. After prefixes are added, the 16-byte nop becomes 18-byte nop, then the 
label '.Ltmp3' is not 8-byte aligned any more.

I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, since 
the alignment is not explicitly required.  For example, I think we can not 
assuming the instruction `jmp .Ltmp0` always starts at byte 0x1e. And in fact, 
as long as the binary generated by the compiler does not have a one-to-one 
correspondence with the assembly code, at least one implict assumption of an 
instruction will be broken.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-07 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.



In D70157#1760278 , @jyknight wrote:

>




> Branch alignment
> 
>  ---
> 
> The primary goal of this patch, restricting the placement of branch 
> instructions, is a performance optimization. Similar to loop alignment, the 
> desire is to increase speed, at the cost of code-size. However, the way this 
> feature has been designed is a global assembler flag. I find that not ideal, 
> because it cannot take into account hotness of a block/function, as for 
> example loop alignment code does. Basic-block alignment of loops is 
> explicitly opt-in on an block-by-block basis -- the compiler simply emits a 
> p2align directive where it needs, and the assembler honors that. And so, 
> MachineBlockPlacement::alignBlocks has a bunch of conditions under which it 
> will avoid emitting a p2align. This seems like a good model -- the assembler 
> does what it's told by the compiler (or assembly-writer). Making the 
> branch-instruction-alignment work similarly seems like it would be good.
> 
> IMO it would be nicest if there could be a directive that requests to 
> specially-align the next instruction. However, the fused-jcc case makes that 
> quite tricky, so perhaps this ought to also be a mode which can be 
> enabled/disabled on a region as well.

Yes, the primary goal of this patch is a performance optimization or 
mitigation. The intention is to provide a simple method for users to mitigate 
the performance impact of JCC MCU with less effort. We also provide users 
several options to tune the performance. But the basic idea is to make it easy 
for users to mitigate it and improve the performance.

Your proposal is a good idea. But I'm afraid it may not cover all the 
scenarios. Firstly, the proposal replies on compiler to detect the hotspots. 
But the compiler needs LTO and/or PGO to get the precise hot spots. Otherwise, 
if the compiler misses the hot spots which impact the application performance, 
the users have no way but have to insert the directives manually. Secondly, for 
the existing codes written in assembly, the compiler can't handle it. The users 
have to insert the directives by hand, which are pretty much work.

I think the current patch wants to give a simple and general solution to 
mitigate JCC MCU performance impact to both C/C++ and Assembly. And it doesn't 
need any source code change. I think your proposal will be a good enhancement 
on top of it.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-06 Thread LuoYuanke via Phabricator via cfe-commits
LuoYuanke added a comment.

In D70157#1773180 , @reames wrote:

> Recording something so I don't forget it when we get back to the prefix 
> padding version.  The write up on the bundle align mode stuff mentions a 
> concerning memory overhead for the feature.  Since the basic implementation 
> techniques are similar, we need to make sure we assess the memory overhead of 
> the prefix padding implementation.  See 
> https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm 
> for context.  I don't believe this is likely to be an issue for the nop 
> padding variant.


From the doc of 
https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm 
the ".bundle_align_mode" ensure each instruction that following the directive 
don't cross the alignment boundary and ensure the first instruction following 
the directive be aligned,  but in this patch we only require branch (jcc, jmp, 
...) instruction don't cross the alignment boundary. Another remind, this patch 
avoid branch instruction hit the alignment boundary, and bundle syntax doesn't 
support that (see section 2.1 of 
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf).
 So I don't think bundle syntax fit current requirement perfectly.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-06 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

Recording something so I don't forget it when we get back to the prefix padding 
version.  The write up on the bundle align mode stuff mentions a concerning 
memory overhead for the feature.  Since the basic implementation techniques are 
similar, we need to make sure we assess the memory overhead of the prefix 
padding implementation.  See 
https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm 
for context.  I don't believe this is likely to be an issue for the nop padding 
variant.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-06 Thread David Zarzycki via Phabricator via cfe-commits
davezarzycki added a comment.

In D70157#1772227 , @annita.zhang 
wrote:

> > 
> > 
> >>> Third, I have not see a justification for why complexity for instruction 
> >>> prefix padding is necessary.  All the effected CPUs support multi-byte 
> >>> nops, so we're talking about a *single micro op* difference between the 
> >>> nop form and prefix form.  Can anyone point to a performance delta due to 
> >>> this?  If not, I'd suggest we should start with the nop form, and then 
> >>> build the prefix form in a generic manner for all alignment varieties.
> >> 
> >> +1.
> > 
> > +1. Starting from just NOP padding sounds a simple and good first step. We 
> > can explore segment override prefixes in the future.
>
> I think it's a good suggestion to start with NOP padding as the first step. 
> In our previous experiment, we saw that the prefix padding was slight better 
> than NOP padding, but not much. We will retest the NOP padding and go back to 
> you.


For whatever it may be worth: Agnor Fog's empirical research on x86 pipelines 
and his review of manufacturer optimization guidelines also concludes that 
prefixes are often preferable to NOPs on modern x86 processors. (See: 
https://www.agner.org/optimize/microarchitecture.pdf) This arguably isn't 
surprising given that the decoder needs to be good at finding instruction 
boundaries but the decoder isn't responsible for interpreting instructions, 
therefore NOPs of any size dilute decode bandwidth.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-06 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

> 
> 
>>> Third, I have not see a justification for why complexity for instruction 
>>> prefix padding is necessary.  All the effected CPUs support multi-byte 
>>> nops, so we're talking about a *single micro op* difference between the nop 
>>> form and prefix form.  Can anyone point to a performance delta due to this? 
>>>  If not, I'd suggest we should start with the nop form, and then build the 
>>> prefix form in a generic manner for all alignment varieties.
>> 
>> +1.
> 
> +1. Starting from just NOP padding sounds a simple and good first step. We 
> can explore segment override prefixes in the future.

I think it's a good suggestion to start with NOP padding as the first step. In 
our previous experiment, we saw that the prefix padding was slight better than 
NOP padding, but not much. We will retest the NOP padding and go back to you.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-05 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added a comment.

In D70157#1772019 , @reames wrote:

> In D70157#1771841 , @jyknight wrote:
>
> > In D70157#1771832 , @reames wrote:
> >
> > > I've been digging through the code for this for the last day or so.  This 
> > > is a new area for me, so it's possible I'm off base, but I have some 
> > > concerns about the current design.
> > >
> > > First, there appears to already be support for instruction bundling and 
> > > alignment in the assembler today.  I stumbled across the 
> > > .bundle_align_mode, .bundle_start, and .bundle_end mechanism 
> > > (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) 
> > > which seems to *heavily* overlap with this proposal.  I suspect that the 
> > > compiler support suggested by James and myself earlier in this thread 
> > > could be implemented on to this existing mechanism.
> > >
> > > Second, the new callbacks and infrastructure added appear to overlap 
> > > heavily w/the MCCodePadding infrastructure.  (Which, admitted, appears 
> > > unused and untested.)
> >
> >
> > My conclusion after looking at all of that was actually that I plan to 
> > propose removing both the MCCodePadding and all the bundle-padding 
> > infrastructure, not add new stuff on top of it -- the former is unused, and 
> > I believe the latter is only for Chrome's NaCL, which is deprecated, and 
> > fairly close to being removed. If we need something similar in the future, 
> > we should certainly look to both of those for inspiration, but I don't 
> > think we need to be constrained by them.
>
>
> I can definitely see removing the code padding stuff, since it's unused and 
> untested.
>
> As for the bundle mechanisms, why?  It seems like exactly what we're going to 
> want here.  Regardless of the auto-detect feature, we're going to need a 
> representation of a bundle which needs to be properly placed to avoid 
> splitting, and the current code does that.  Why not reuse the, presumable 
> reasonable well tested, existing infrastructure?   The only extra thing we 
> seem to need is the ability to toggle off bundle formation for instruction 
> types we don't care about.  Since we're going to need an assembler spelling 
> of that regardless, it seems like the current code is a decent baseline?


I created D71106  to delete MCCodePadder and 
accompanying classes.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-05 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

In D70157#1771841 , @jyknight wrote:

> In D70157#1771832 , @reames wrote:
>
> > I've been digging through the code for this for the last day or so.  This 
> > is a new area for me, so it's possible I'm off base, but I have some 
> > concerns about the current design.
> >
> > First, there appears to already be support for instruction bundling and 
> > alignment in the assembler today.  I stumbled across the 
> > .bundle_align_mode, .bundle_start, and .bundle_end mechanism 
> > (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which 
> > seems to *heavily* overlap with this proposal.  I suspect that the compiler 
> > support suggested by James and myself earlier in this thread could be 
> > implemented on to this existing mechanism.
> >
> > Second, the new callbacks and infrastructure added appear to overlap 
> > heavily w/the MCCodePadding infrastructure.  (Which, admitted, appears 
> > unused and untested.)
>
>
> My conclusion after looking at all of that was actually that I plan to 
> propose removing both the MCCodePadding and all the bundle-padding 
> infrastructure, not add new stuff on top of it -- the former is unused, and I 
> believe the latter is only for Chrome's NaCL, which is deprecated, and fairly 
> close to being removed. If we need something similar in the future, we should 
> certainly look to both of those for inspiration, but I don't think we need to 
> be constrained by them.


I can definitely see removing the code padding stuff, since it's unused and 
untested.

As for the bundle mechanisms, why?  It seems like exactly what we're going to 
want here.  Regardless of the auto-detect feature, we're going to need a 
representation of a bundle which needs to be properly placed to avoid 
splitting, and the current code does that.  Why not reuse the, presumable 
reasonable well tested, existing infrastructure?   The only extra thing we seem 
to need is the ability to toggle off bundle formation for instruction types we 
don't care about.  Since we're going to need an assembler spelling of that 
regardless, it seems like the current code is a decent baseline?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-05 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added a subscriber: opaparo.
MaskRay added a comment.

In D70157#1771841 , @jyknight wrote:

> In D70157#1771832 , @reames wrote:
>
> > I've been digging through the code for this for the last day or so.  This 
> > is a new area for me, so it's possible I'm off base, but I have some 
> > concerns about the current design.
> >
> > First, there appears to already be support for instruction bundling and 
> > alignment in the assembler today.  I stumbled across the 
> > .bundle_align_mode, .bundle_start, and .bundle_end mechanism 
> > (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which 
> > seems to *heavily* overlap with this proposal.  I suspect that the compiler 
> > support suggested by James and myself earlier in this thread could be 
> > implemented on to this existing mechanism.
> >
> > Second, the new callbacks and infrastructure added appear to overlap 
> > heavily w/the MCCodePadding infrastructure.  (Which, admitted, appears 
> > unused and untested.)
>
>
> My conclusion after looking at all of that was actually that I plan to 
> propose removing both the MCCodePadding and all the bundle-padding 
> infrastructure, not add new stuff on top of it -- the former is unused, and I 
> believe the latter is only for Chrome's NaCL, which is deprecated, and fairly 
> close to being removed. If we need something similar in the future, we should 
> certainly look to both of those for inspiration, but I don't think we need to 
> be constrained by them.


CC the author of D34393  - @opaparo for 
MCCodePadding. Intel folks may know how to contact @opaparo?

I also noticed that MCCodePadder.cpp is never updated (except a license change) 
after the initial check-in.

>> Third, I have not see a justification for why complexity for instruction 
>> prefix padding is necessary.  All the effected CPUs support multi-byte nops, 
>> so we're talking about a *single micro op* difference between the nop form 
>> and prefix form.  Can anyone point to a performance delta due to this?  If 
>> not, I'd suggest we should start with the nop form, and then build the 
>> prefix form in a generic manner for all alignment varieties.
> 
> +1.

+1. Starting from just NOP padding sounds a simple and good first step. We can 
explore segment override prefixes in the future.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-05 Thread James Y Knight via Phabricator via cfe-commits
jyknight added a comment.

In D70157#1771832 , @reames wrote:

> I've been digging through the code for this for the last day or so.  This is 
> a new area for me, so it's possible I'm off base, but I have some concerns 
> about the current design.
>
> First, there appears to already be support for instruction bundling and 
> alignment in the assembler today.  I stumbled across the .bundle_align_mode, 
> .bundle_start, and .bundle_end mechanism 
> (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which 
> seems to *heavily* overlap with this proposal.  I suspect that the compiler 
> support suggested by James and myself earlier in this thread could be 
> implemented on to this existing mechanism.
>
> Second, the new callbacks and infrastructure added appear to overlap heavily 
> w/the MCCodePadding infrastructure.  (Which, admitted, appears unused and 
> untested.)


My conclusion after looking at all of that was actually that I plan to propose 
removing both the MCCodePadding and all the bundle-padding infrastructure, not 
add new stuff on top of it -- the former is unused, and I believe the latter is 
only for Chrome's NaCL, which is deprecated, and fairly close to being removed. 
If we need something similar in the future, we should certainly look to both of 
those for inspiration, but I don't think we need to be constrained by them.

> Third, I have not see a justification for why complexity for instruction 
> prefix padding is necessary.  All the effected CPUs support multi-byte nops, 
> so we're talking about a *single micro op* difference between the nop form 
> and prefix form.  Can anyone point to a performance delta due to this?  If 
> not, I'd suggest we should start with the nop form, and then build the prefix 
> form in a generic manner for all alignment varieties.

+1.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-05 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

I've been digging through the code for this for the last day or so.  This is a 
new area for me, so it's possible I'm off base, but I have some concerns about 
the current design.

First, there appears to already be support for instruction bundling and 
alignment in the assembler today.  I stumbled across the .bundle_align_mode, 
.bundle_start, and .bundle_end mechanism 
(https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which 
seems to *heavily* overlap with this proposal.  I suspect that the compiler 
support suggested by James and myself earlier in this thread could be 
implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily 
w/the MCCodePadding infrastructure.  (Which, admitted, appears unused and 
untested.)

Third, I have not see a justification for why complexity for instruction prefix 
padding is necessary.  All the effected CPUs support multi-byte nops, so we're 
talking about a *single micro op* difference between the nop form and prefix 
form.  Can anyone point to a performance delta due to this?  If not, I'd 
suggest we should start with the nop form, and then build the prefix form in a 
generic manner for all alignment varieties.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-05 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

We uncovered another functional issue with this patch, or at least, the 
interaction of this patch and other parts of LLVM.  In our support for 
STATEPOINT, PATCHPOINT, and STACKMAP we use N-byte nop sequences for regions of 
code which might be patched out.  It's important that these regions are exactly 
N bytes as concurrent patching which doesn't replace an integral number of 
instructions is ill-defined on X86-64.  This patch causes the N-byte nop 
sequence to sometimes become (N+M) bytes which breaks the patching.  I believe 
that the XRAY support may have a similar issue.

More generally, I'm worried about the legality of arbitrarily prefixing 
instructions from unknown sources.  In the particular example we saw, we had 
something along the following:

.Ltmp0:
.p2align3, 0x90
(16 byte nop sequence)
.Ltmp3:
jmp *%rax

In addition to the patching legality issue above, padding the nop sequence does 
something else interesting in this example.  It changes the alignment of Ltmp3. 
 Before, Ltmp3 was always 8 byte aligned, after prefixes are added, it's not.  
It's not clear to me exactly what the required semantics here are, but we at 
least had been assuming the alignment of Ltmp3 was guaranteed in this case.  
(That's actually how we found the patching issue.)


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-05 Thread Fedor Sergeev via Phabricator via cfe-commits
fedor.sergeev added a comment.

In D70157#1768310 , @skan wrote:

> fix the bug
>
> https://bugs.llvm.org/show_bug.cgi?id=44215


FYI: I did close the bug as fixed after verifying that the fix works for me.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-05 Thread Kan Shengchen via Phabricator via cfe-commits
skan marked 3 inline comments as done.
skan added inline comments.



Comment at: llvm/test/MC/X86/x86-64-align-branch-1b.s:10
+# CHECK:  foo:
+# CHECK-NEXT:0: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:8: 2e 55pushq   %rbp

MaskRay wrote:
> I think 1a.s and 1b.s should be merged. FileCheck supports
> 
> --check-prefixes=CHECK,PREFIX5
> --check-prefixes=CHECK,PREFIX1
> 
> ```
> CHECK: common part
> CHECK-NEXT: common part
> PREFIX5:
> PREFIX5-NEXT:
> PREFIX1:
> PREFIX1-NEXT:
> CHECK:
> CHECK-NEXT:
> ```
> 
> ```
> % diff -U1 x86-64-align-branch-1[ab].s
>  # CHECK:  foo:
> -# CHECK-NEXT:0: 64 64 64 64 89 04 25 01 00 00 00 movl%eax, %fs:1
> -# CHECK-NEXT:b: 55   pushq   %rbp
> -# CHECK-NEXT:c: 55   pushq   %rbp
> -# CHECK-NEXT:d: 55   pushq   %rbp
> +# CHECK-NEXT:0: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
> +# CHECK-NEXT:8: 2e 55pushq   %rbp
> +# CHECK-NEXT:a: 2e 55pushq   %rbp
> +# CHECK-NEXT:c: 2e 55pushq   %rbp
>  # CHECK-NEXT:e: 48 89 e5 movq%rsp, %rbp
> ```
> 
> Is there performance benefit to add 4 prefixes to the same instruction?
Thanks for the knowledge!  There is no performance benefit to add 4 prefixes to 
the same instruction. The  instruction `movl%eax, %fs:1` has already has a 
prefix %fs (0x64),  if option `-x86-align-branch-prefix-size=5` is used, we 
could add 4 more prefixes at most. The branch to be aligned needs 3-byte 
padding, so 3 prefixes are added to the move instruction.  



Comment at: llvm/test/MC/X86/x86-64-align-branch-1c.s:2
+# Check only fused conditional jumps and conditional jumps are aligned with 
option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc  
--x86-align-branch-prefix-size=5
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown 
--x86-align-branch-boundary=32 --x86-align-branch=fused+jcc  
--x86-align-branch-prefix-size=5 %s | llvm-objdump -d  - | FileCheck %s
+

MaskRay wrote:
> The difference between 1a and 1c is that 1c does not allow "jmp", but in 1a 
> no jmp instructions get a prefix in the test, so it is unclear why 1c has 
> different output.
```
# CHECK-NEXT:   45: 2e 89 45 fc  movl%eax, 
%cs:-4(%rbp)
# CHECK-NEXT:   49: 89 75 f4 movl%esi, -12(%rbp)
# CHECK-NEXT:   4c: 89 7d f8 movl%edi, -8(%rbp)
# CHECK-NEXT:   4f: 89 75 f4 movl%esi, -12(%rbp)
# CHECK-NEXT:   52: 89 75 f4 movl%esi, -12(%rbp)
# CHECK-NEXT:   55: 89 75 f4 movl%esi, -12(%rbp)
# CHECK-NEXT:   58: 89 75 f4 movl%esi, -12(%rbp)
# CHECK-NEXT:   5b: 89 75 f4 movl%esi, -12(%rbp)
# CHECK-NEXT:   5e: 5d   popq%rbp
# CHECK-NEXT:   5f: 5d   popq%rbp
# CHECK-NEXT:   60: eb 26jmp {{.*}}
```

The prefix 2e is added at

```
 45: 2e 89 45 fc  movl%eax, %cs:-4(%rbp)
```



Comment at: llvm/test/MC/X86/x86-64-align-branch-1e.s:46
+# CHECK-NEXT:   5c: eb 27jmp {{.*}}
+# CHECK-NEXT:   5e: 90   nop
+# CHECK-NEXT:   5f: 90   nop

MaskRay wrote:
> This is weird. Comparing this with 1d, 1e allows more instruction types, yet 
> it inserts two NOPs which actually seems to degrade performance.
Not all the target cpu support long nop, you can find related code in 
X86AsmBackend::writeNopData.



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-05 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1769932 , @MaskRay wrote:

> I find another deficiency (infinite loop) with the current approach.
>
> Say, there is a `je 0` (0x0F 0x84 0x00 0x00 0x00 0x00) at byte 0x90. 
> (0x90+6)%32 == 0, so it ends on a 32-byte boundary.
>  MF.getMaxPrefixSize() is 4, so the size of MCMachineDependentFragment may 
> vary from 0 to 4.
>  If there are other MCMachineDependentFragment's in the section, some may 
> shrink while some may expand.
>  In some cases the following loop will not converge
>
>   bool MCAssembler::layoutOnce(MCAsmLayout ) {
> ++stats::RelaxationSteps;
>  
> bool WasRelaxed = false;
> for (iterator it = begin(), ie = end(); it != ie; ++it) {
>   MCSection  = *it;
>   while (layoutSectionOnce(Layout, Sec)) ///
> WasRelaxed = true;
> }
>  
> return WasRelaxed;
>   }
>   
>   // In MCAssembler::layoutSectionOnce,
> case MCFragment::FT_MachineDependent:
>   RelaxedFrag =
>   relaxMachineDependent(Layout, *cast(I));
>   break;
>   
>
> To give a concrete example, `clang++ -fsanitize=memory 
> compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries` does 
> not converge. You may also try dtor-*.cpp in that directory.
>
> A simple iterative algorithm is not guaranteed to converge. We probably can 
> solve the layout problem with a dynamic programming algorithm:
>
> f[i][j] = the minimum cost that layouts the first i instructions with j extra 
> bytes (via NOPs or prefixes)
> or
>  g[i][j] = the minimum inserted bytes that layouts the first i instructions 
> with cost j
>
> I am not clear which one is better. A simple greedy approach is to set an 
> upper limit on the number of iterations when the section contains at least 
> one MCMachineDependentFragment, i.e.
>
>   bool HasMCMachineDependentFragment = false;
>   int count = 5; // arbitrary. Please find an appropriate value.
>   while (layoutSectionOnce(Layout, Sec, HasMCMachineDependentFragment) && 
> count > 0) {
> WasRelaxed = true;
> if (HasMCMachineDependentFragment)
>   count--;
>   }


I guess you originally wanted to say (90 + 6)%32 == 0 instead of (0x90 +6)%32 
== 0. With the last patch, the command

`clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp 
-mbranches-within-32B-boundaries` does not converge, but this is not the fault 
of this simple iterative algorithm.

Let me briefly introduce this algorithm. Regardless of whether the prefix or 
nop is filled in, `MCMachineDependentFragment` is used to occupy a certain size 
of space to align the branch, and it has a pointer to the branch it is 
responsible for. Multiple MCMachineDependentFragments can work together to 
align a branch. For example, there are two MCMachineDependentFragment, M1 
 and M2  to align 
branch J1 . In the first iteration, J1 
 needs 7 bytes of padding, so the size of M1 
 will grow to 7 as much as possible. However, M1 
 can only grow to 5 or smaller from some reason. 
At this time, M2  will grow as much as possible to 
2. In the second iteration, when it's M1 's turn 
to relax, we will subtract the size of M1  and M2 
 from the current address of J1 
 to get the size that J1 
 needs to padding, assuming it is 6, and then M1 
 will keep the size to 5 and the size of M2 
 will be reduced to 1. It is not difficult to see 
that among the MCFragment from M1  to J1 
, as long as the size of MCFragment other than M1 
 and M2  is fixed 
within a limited number of iterations, the sizes of M1 
 and M2  will be 
fixed within a limited number of iterations, that is, the iteration will 
converge.

As far as I know, the `MCFragment` used to store instructions includes 
`MCDataFragment`, `MCRelaxableFragment` and `MCCompactEncodedFragment`.  
`MCDataFragment` and `MCCompactEncodedFragment` have fixed sizes. 
`MCRelaxableFragment` is used to store instructions of variable size and will 
only grow and not shrink, as a result, its size will be fixed within a limited 
number of cycles . So as long as we ensure that there is only MCFragment that 
stores instructions between the  instruction to be prefixed and the branch to 
be aligned, this iterative algorithm will converge.

The file cxa_atexit.s  has more than one text section. The instruction to be 
prefixed and the branch to align may be in two different text sections. I 
forgot to check if there is a 

[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-04 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added inline comments.



Comment at: llvm/lib/MC/MCAssembler.cpp:1014
+  unsigned EndAddr = StartAddr + Size;
+  return StartAddr / BoundarySize != ((EndAddr - 1) / BoundarySize);
+}

Division is slow. Pass in the power of 2 and use right shift instead.

You may change `MCMachineDependentFragment::AlignBoundarySize` (a power of 2) 
to a power.



Comment at: llvm/lib/MC/MCAssembler.cpp:1022
+  unsigned EndAddr = StartAddr + Size;
+  return EndAddr % BoundarySize == 0;
+}

Ditto.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-04 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added a comment.

I find another deficiency (infinite loop) with the current approach.

Say, there is a `je 0` (0x0F 0x84 0x00 0x00 0x00 0x00) at byte 0x90. 
(0x90+6)%32 == 0, so it ends on a 32-byte boundary.
MF.getMaxPrefixSize() is 4, so the size of MCMachineDependentFragment may vary 
from 0 to 4.
If there are other MCMachineDependentFragment's in the section, some may shrink 
while some may expand.
In some cases the following loop will not converge

  bool MCAssembler::layoutOnce(MCAsmLayout ) {
++stats::RelaxationSteps;
  
bool WasRelaxed = false;
for (iterator it = begin(), ie = end(); it != ie; ++it) {
  MCSection  = *it;
  while (layoutSectionOnce(Layout, Sec)) ///
WasRelaxed = true;
}
  
return WasRelaxed;
  }
  
  // In MCAssembler::layoutSectionOnce,
case MCFragment::FT_MachineDependent:
  RelaxedFrag =
  relaxMachineDependent(Layout, *cast(I));
  break;

To give a concrete example, `clang++ -fsanitize=memory 
compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries` does not 
converge. You may also try dtor-*.cpp in that directory.

A simple iterative algorithm is not guaranteed to converge. We probably can 
solve the layout problem with a dynamic programming algorithm:

f[i][j] = the minimum cost that layouts the first i instructions with j extra 
bytes (via NOPs or prefixes)
or
g[i][j] = the minimum inserted bytes that layouts the first i instructions with 
cost j

I am not clear which one is better. A simple greedy approach is to set an upper 
limit on the number of iterations when the section contains at least one 
MCMachineDependentFragment, i.e.

  bool HasMCMachineDependentFragment = false;
  int count = 5; // arbitrary. Please find an appropriate value.
  while (layoutSectionOnce(Layout, Sec, HasMCMachineDependentFragment) && count 
> 0) {
WasRelaxed = true;
if (HasMCMachineDependentFragment)
  count--;
  }


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-04 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added a comment.

I am still trying to understand the patch. Just made some comments about the 
tests.




Comment at: llvm/include/llvm/MC/MCFragment.h:663
+  enum SubType : uint8_t {
+// BranchPadding - The variable size fragment to insert NOP before branch.
+BranchPadding,

Don’t duplicate function or class name at the beginning of the comment 
(`BranchPadding - `). (ref: 
https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments)



Comment at: llvm/include/llvm/MC/MCFragment.h:674
+FusedJccPadding,
+// HardCodeBegin - The zero size fragment to mark the begin of the sequence
+// of hard code

Full stop.



Comment at: llvm/include/llvm/MC/MCFragment.h:709
+switch (SubKind) {
+default:
+  llvm_unreachable("Unknown subtype of MCMachineDependentFragment");

Move llvm_unreachable below the switch, otherwise clang will give a warning:

  warning: default label in switch which covers all enumera
tion values [-Wcovered-switch-default]

Unfortunately all GCC (even 9) -Wall will warn `warning: control reaches end of 
non-void function [-Wreturn-type]` unless you place an unreachable statement.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:537
+  // Linker may rewrite the instruction with variant symbol operand.
+  if(hasVariantSymbol(Inst)) return false;
+

Space after `if`



Comment at: llvm/test/MC/X86/x86-64-align-branch-1a.s:1
+# Check option --x86-branches-within-32B-boundaries is equivalent to the 
combination of options --x86-align-branch-boundary=32 
--x86-align-branch=fused+jcc+jmp  --x86-align-branch-prefix-size=5
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown 
--x86-branches-within-32B-boundaries %s | llvm-objdump -d  - > %t

1a~1g use the same source file. Move the source to `Inputs/align-branch-1-64.s`

According to the local naming convention, this test should probably be renamed 
to `align-branch-1-64.s`



Comment at: llvm/test/MC/X86/x86-64-align-branch-1a.s:7
+
+# CHECK: file format {{.*}}
+

Delete



Comment at: llvm/test/MC/X86/x86-64-align-branch-1a.s:11
+
+# CHECK: Disassembly of section .text:
+# CHECK:  foo:

Delete `Disassembly of section .text:`. Ditto below.



Comment at: llvm/test/MC/X86/x86-64-align-branch-1b.s:4
+
+# CHECK: file format {{.*}}
+

Delete



Comment at: llvm/test/MC/X86/x86-64-align-branch-1b.s:10
+# CHECK:  foo:
+# CHECK-NEXT:0: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:8: 2e 55pushq   %rbp

I think 1a.s and 1b.s should be merged. FileCheck supports

--check-prefixes=CHECK,PREFIX5
--check-prefixes=CHECK,PREFIX1

```
CHECK: common part
CHECK-NEXT: common part
PREFIX5:
PREFIX5-NEXT:
PREFIX1:
PREFIX1-NEXT:
CHECK:
CHECK-NEXT:
```

```
% diff -U1 x86-64-align-branch-1[ab].s
 # CHECK:  foo:
-# CHECK-NEXT:0: 64 64 64 64 89 04 25 01 00 00 00 movl%eax, %fs:1
-# CHECK-NEXT:b: 55   pushq   %rbp
-# CHECK-NEXT:c: 55   pushq   %rbp
-# CHECK-NEXT:d: 55   pushq   %rbp
+# CHECK-NEXT:0: 64 89 04 25 01 00 00 00  movl%eax, %fs:1
+# CHECK-NEXT:8: 2e 55pushq   %rbp
+# CHECK-NEXT:a: 2e 55pushq   %rbp
+# CHECK-NEXT:c: 2e 55pushq   %rbp
 # CHECK-NEXT:e: 48 89 e5 movq%rsp, %rbp
```

Is there performance benefit to add 4 prefixes to the same instruction?



Comment at: llvm/test/MC/X86/x86-64-align-branch-1c.s:2
+# Check only fused conditional jumps and conditional jumps are aligned with 
option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc  
--x86-align-branch-prefix-size=5
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown 
--x86-align-branch-boundary=32 --x86-align-branch=fused+jcc  
--x86-align-branch-prefix-size=5 %s | llvm-objdump -d  - | FileCheck %s
+

The difference between 1a and 1c is that 1c does not allow "jmp", but in 1a no 
jmp instructions get a prefix in the test, so it is unclear why 1c has 
different output.



Comment at: llvm/test/MC/X86/x86-64-align-branch-1d.s:4
+
+# CHECK: file format {{.*}}
+

Delete



Comment at: llvm/test/MC/X86/x86-64-align-branch-1e.s:46
+# CHECK-NEXT:   5c: eb 27jmp {{.*}}
+# CHECK-NEXT:   5e: 90   nop
+# CHECK-NEXT:   5f: 90   nop

This is weird. 

[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-04 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1768389 , @craig.topper 
wrote:

> Can you please put the macro fusion changes in a separate phabricator review. 
> I’ll review it in the morning US time and if it all looks good we can get 
> that part committed while the other comments are being addressed.


Sure.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-04 Thread Craig Topper via Phabricator via cfe-commits
craig.topper added a comment.

Can you please put the macro fusion changes in a separate phabricator review. 
I’ll review it in the morning US time and if it all looks good we can get that 
part committed while the other comments are being addressed.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-04 Thread Kan Shengchen via Phabricator via cfe-commits
skan marked an inline comment as done.
skan added inline comments.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h:158
+case X86::AND16i16:
+case X86::AND16mr:
+case X86::AND16ri:

craig.topper wrote:
> None of the AND/ADD/SUB instructions ending in mr are eligible for 
> macrofusion as far as I know. Those all involve a load and a store which is 
> not supported by macrofusion.
> 
> We also lost all the ADD*_DB instructions from the macrofusion list. I 
> believe they are in the existing list incorrectly. So removing them is 
> correct, but as far as I can see that change was not mentioned in the 
> description of this patch.
> 
> Can we split the macrofusion refactoring out of this patch so we can review 
> it separately and hopefully get it committed sooner than the other review 
> feedback.
Okay,  I will upload another patch to correct the macrofusion table as soon as 
possible.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-03 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

In D70157#1768319 , @chandlerc wrote:

> I'm seeing lots of updates to fix bugs, but no movement for many days on both 
> my meta comments and (in some ways more importantly) James's meta comments. 
> (And thanks Philip for chiming in too!)
>
> Meanwhile, we really, really need to get this functionality in place. The 
> entire story for minimizing the new microcode performance hit hinges on these 
> patches, and I'm really worried by how little progress we're seeing here.


Sorry for belated response. We're working hard to go through some paper work to 
get the performance data ready. I think maybe it's better to open a mailing 
thread in llvm-dev to post those performance data and discuss those suggestions.

Thanks,
Annita


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-03 Thread Craig Topper via Phabricator via cfe-commits
craig.topper added inline comments.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h:158
+case X86::AND16i16:
+case X86::AND16mr:
+case X86::AND16ri:

None of the AND/ADD/SUB instructions ending in mr are eligible for macrofusion 
as far as I know. Those all involve a load and a store which is not supported 
by macrofusion.

We also lost all the ADD*_DB instructions from the macrofusion list. I believe 
they are in the existing list incorrectly. So removing them is correct, but as 
far as I can see that change was not mentioned in the description of this patch.

Can we split the macrofusion refactoring out of this patch so we can review it 
separately and hopefully get it committed sooner than the other review feedback.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-03 Thread Chandler Carruth via Phabricator via cfe-commits
chandlerc added a comment.

I'm seeing lots of updates to fix bugs, but no movement for many days on both 
my meta comments and (in some ways more importantly) James's meta comments. 
(And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The 
entire story for minimizing the new microcode performance hit hinges on these 
patches, and I'm really worried by how little progress we're seeing here.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-03 Thread Fedor Sergeev via Phabricator via cfe-commits
fedor.sergeev requested changes to this revision.
fedor.sergeev added a comment.
This revision now requires changes to proceed.

In D70157#1767212 , @fedor.sergeev 
wrote:

> Working on getting upstream reproducer


Hangs on a moderately small piece  (~150 lines) of x86 assembly, filed a bug on 
it here:

https://bugs.llvm.org/show_bug.cgi?id=44215


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-03 Thread Fedor Sergeev via Phabricator via cfe-commits
fedor.sergeev added a comment.

Just FYI for now as I'm trying to dig futher...
I have been trying this fix in our downstream environment and managed to get a 
hang with this backtrace:

#0 needPadding (BoundarySize=, Size=, 
StartAddr=) at ./llvm/lib/MC/MCAssembler.cpp:1028
#1 llvm::MCAssembler::relaxMachineDependent (this=, Layout=..., 
MF=...) at ./llvm/lib/MC/MCAssembler.cpp:1077
#2 0x71adc8b6 in llvm::MCAssembler::layoutSectionOnce 
(this=this@entry=0x7fff90580580, Layout=..., Sec=...) at 
./orca/llvm/lib/MC/MCAssembler.cpp:1213
...

Working on getting upstream reproducer


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-12-02 Thread Philip Reames via Phabricator via cfe-commits
reames added a comment.

I want to chime in support of jyknight's meta comments - particularly the one 
about the need to balance execution speed vs code size differently in hot vs 
cold code.  For our use case, we have a very large amount of branch dense known 
cold paths, and being able to only align fast path branches would be a 
substantial space savings.

I also see value in having the prefix padding feature factored out generically. 
 If that mechanism is truly measurably faster than multi-byte nops - which if I 
reading comments correctly, has been claimed but not documented or measured? - 
using it generically for other alignment purposes would likely be worthwhile.

I'd also like to see - probably in a separate patch - support for 
auto-detecting whether the host CPU needs this mitigation.  Both -mcpu=native 
and various JITs will end up needing this, having the code centralized in one 
place would be good.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-26 Thread James Y Knight via Phabricator via cfe-commits
jyknight added a comment.

I have some other concerns about the code itself, but after pondering this a 
little bit, I'd like to instead bring up some rather more general concerns 
about the overall approach used here -- with some suggestions. (As these 
comments are not really comments on the _code_, but on the overall strategy, 
they also apply to the gnu binutils patch for this same feature.)

Of course, echoing chandler, it would be nice to see some performance numbers, 
otherwise it's not clear how useful any of this is.

Segment-prefix padding
--

The ability to pad instructions instead of adding a multibyte-NOP in order to 
create alignment seems like a generally-useful feature, which should be usable 
in other situations where we align within a function -- assuming that there is 
indeed a measurable performance benefit vs NOP instructions. E.g. we should do 
this for basic-block alignment, as well! As such, it feels like this feature 
ought to be split out, and implemented separately from the new branch-alignment 
functionality -- in a way which is usable for any kind of alignment request.

The way I'd imagine it working is to introduce a new pair of asm directives, to 
enable and disable segment-prefix padding in a certain range of instructions 
(let's say ".enable_instr_prefix_pad", ".disable_instr_prefix_pad". Within such 
an enabled range, the assembler would prefix instructions as required in order 
to handle later nominally-nop-emitting code alignment directives (such as the 
usual '.p2align 4, 0x90') .

Branch alignment


The primary goal of this patch, restricting the placement of branch 
instructions, is a performance optimization. Similar to loop alignment, the 
desire is to increase speed, at the cost of code-size. However, the way this 
feature has been designed is a global assembler flag. I find that not ideal, 
because it cannot take into account hotness of a block/function, as for example 
loop alignment code does. Basic-block alignment of loops is explicitly opt-in 
on an block-by-block basis -- the compiler simply emits a p2align directive 
where it needs, and the assembler honors that. And so, 
MachineBlockPlacement::alignBlocks has a bunch of conditions under which it 
will avoid emitting a p2align. This seems like a good model -- the assembler 
does what it's told by the compiler (or assembly-writer). Making the 
branch-instruction-alignment work similarly seems like it would be good.

IMO it would be nicest if there could be a directive that requests to 
specially-align the next instruction. However, the fused-jcc case makes that 
quite tricky, so perhaps this ought to also be a mode which can be 
enabled/disabled on a region as well.

Enabling by default
---

Previously, I'd mentioned that it seemed likely we couldn't actually enable 
branch-alignment by default because it'll probably break people's inline-asm 
and standalone asm files. That would be solved by making everything 
controllable within the asm file. The compiler could then insert the directives 
for its own code, and disable it around inline assembly. And standalone asm 
files could remain unaffected, unless they opt in. With that, we could actually 
enable the alignment by default, for compiled output in certain cpu-tuning 
modes, if it's warranted.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-24 Thread Chandler Carruth via Phabricator via cfe-commits
chandlerc added a comment.

(Just a reminder that we need to have both performance and code size numbers 
for this patch. And given that there are a few options, may need a few 
examples.)


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-22 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1755927 , @jyknight wrote:

> An alternative would be to simply emit NOPs before branches as needed. That 
> would be substantially simpler, since it would only require special handling 
> for a branch or a fused-branch. I assume things were done this 
> substantially-more-complex way in order to reduce performance cost of 
> inserting NOP instructions? Are there numbers for how much better it is to 
> use segment prefixes, vs a separate nop instruction? It seems a little bit 
> surprising to me that it would be that important, but I don't know...
>
> I'll note that the method here has the semantic issue of making it 
> effectively impossible to ever evaluate an expression like ".if . - symbol == 
> 24" (assuming we're emitting instructions), since every instruction can now 
> change size. I suspect that will make it impossible to turn this on by 
> default without breaking a lot of assembly code. Previously, only certain 
> instructions, like branches or arithmetic ops with constant arguments of 
> unknown value, could change size.


Thanks for your remind! Now, if `-malign-branch-prefix-size=0` is used, the 
method will only insert `BranchPadding` or `FusedJccPadding` before branches as 
needed, and insert `BranchPrefix` before the instruction which is macro fusible 
but not macro fused. In this condition, the operation is as simple as inserting 
nop only. Since more performance can be recovered by inserting prefixes than 
inserting nops,  I believe I should support prefix padding.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-21 Thread Kan Shengchen via Phabricator via cfe-commits
skan marked an inline comment as done.
skan added a comment.

In D70157#1755927 , @jyknight wrote:

> Thanks for the comments, they help a little. But it's still somewhat 
> confusing, so let me write down what seems to be happening:
>
> - Before emitting every instruction, a new MCMachineDependentFragment is now 
> emitted, of one of the multiple types:
>   - For most instructions, that'll be BranchPrefix.
>   - For things that need branch-alignment, it'll be BranchPadding, unless 
> there's a fused conditional before, in which case it's BranchSplit
>   - For fused conditionals, it'll be FusedJccPadding.
> - After emitting an instruction that needs branch-alignment, all of those 
> previously-emitted MCMachineDependentFragment are updated to point to the 
> branch's fragment.
> - Thus, every MCDataFragment now only contains a single instruction (this 
> property is depended upon for getInstSize, at least).
>
>   All the MCMachineDependentFragments in a region bounded by a branch at the 
> end and either a branch or a fragment-type which is not type in {FT_Data, 
> FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, 
> will reference the ending branch instruction's fragment.
>
>   Then, when it comes time to do relaxation, every one of those 
> machine-dependent-fragments has the opportunity to grow its instruction a 
> little bit. The first instruction in a "block" will grow up to 5 segment 
> prefixes (via modifying the BranchPrefix fragment), and then if more is 
> needed, more prefixes will be added to the next instruction, and so on. Until 
> you run out of instructions in the region. At which point the BranchPadding 
> or FusedJccPadding types (right before the branch/fused-branch) will be able 
> to emit nops to achieve the desired alignment.
>
>   An alternative would be to simply emit NOPs before branches as needed. That 
> would be substantially simpler, since it would only require special handling 
> for a branch or a fused-branch. I assume things were done this 
> substantially-more-complex way in order to reduce performance cost of 
> inserting NOP instructions? Are there numbers for how much better it is to 
> use segment prefixes, vs a separate nop instruction? It seems a little bit 
> surprising to me that it would be that important, but I don't know...
>
>   I'll note that the method here has the semantic issue of making it 
> effectively impossible to ever evaluate an expression like ".if . - symbol == 
> 24" (assuming we're emitting instructions), since every instruction can now 
> change size. I suspect that will make it impossible to turn this on by 
> default without breaking a lot of assembly code. Previously, only certain 
> instructions, like branches or arithmetic ops with constant arguments of 
> unknown value, could change size.


Thanks for your detailed and accurate analysis for my code! I am sorry that 
this should have been done by me.




Comment at: llvm/lib/MC/MCAssembler.cpp:1058
+/// is the address of the symbol, which would casuse performance regression.
+void MCAssembler::moveSymbol(const MCFragment *Src, MCFragment *Dst) const {
+  if (!(Src && Dst && Dst->getKind() == MCFragment::FT_MachineDependent))

jyknight wrote:
> I don't think this is necessary. AFAICT, the symbols should already be in the 
> right place -- pointing to the relax fragment, not the instruction itself, 
> without this. And removing all this moveSymbol/updateSymbolMap code doesn't 
> make any tests fail.
Yes, I check it and you are right. I will removing all this 
moveSymbol/updateSymbolMap code.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-21 Thread Eli Friedman via Phabricator via cfe-commits
efriedma added inline comments.



Comment at: llvm/include/llvm/MC/MCFragment.h:684
+  // need to be aligned.
+  static unsigned AlignBoundarySize;
+  // AlignMaxPrefixSize - The maximum size of prefixes per instruction.

Global variables are forbidden in LLVM libraries; there could be multiple 
LLVMContexts in the same process.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-21 Thread Craig Topper via Phabricator via cfe-commits
craig.topper added a comment.

In D70157#1755927 , @jyknight wrote:

> Thanks for the comments, they help a little. But it's still somewhat 
> confusing, so let me write down what seems to be happening:
>
> - Before emitting every instruction, a new MCMachineDependentFragment is now 
> emitted, of one of the multiple types:
>   - For most instructions, that'll be BranchPrefix.
>   - For things that need branch-alignment, it'll be BranchPadding, unless 
> there's a fused conditional before, in which case it's BranchSplit
>   - For fused conditionals, it'll be FusedJccPadding.
> - After emitting an instruction that needs branch-alignment, all of those 
> previously-emitted MCMachineDependentFragment are updated to point to the 
> branch's fragment.
> - Thus, every MCDataFragment now only contains a single instruction (this 
> property is depended upon for getInstSize, at least).
>
>   All the MCMachineDependentFragments in a region bounded by a branch at the 
> end and either a branch or a fragment-type which is not type in {FT_Data, 
> FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, 
> will reference the ending branch instruction's fragment.
>
>   Then, when it comes time to do relaxation, every one of those 
> machine-dependent-fragments has the opportunity to grow its instruction a 
> little bit. The first instruction in a "block" will grow up to 5 segment 
> prefixes (via modifying the BranchPrefix fragment), and then if more is 
> needed, more prefixes will be added to the next instruction, and so on. Until 
> you run out of instructions in the region. At which point the BranchPadding 
> or FusedJccPadding types (right before the branch/fused-branch) will be able 
> to emit nops to achieve the desired alignment.
>
>   An alternative would be to simply emit NOPs before branches as needed. That 
> would be substantially simpler, since it would only require special handling 
> for a branch or a fused-branch. I assume things were done this 
> substantially-more-complex way in order to reduce performance cost of 
> inserting NOP instructions? Are there numbers for how much better it is to 
> use segment prefixes, vs a separate nop instruction? It seems a little bit 
> surprising to me that it would be that important, but I don't know...


I don't have any numbers myself. I was only involved in some of the code review 
internally. My understanding is that NOP instructions would place extra nop 
uops into the DSB(the decoded uop buffer) and that limits the performance that 
can be recovered. By using redundant prefixes no extra uops are generated and 
more performance is recovered.

> I'll note that the method here has the semantic issue of making it 
> effectively impossible to ever evaluate an expression like ".if . - symbol == 
> 24" (assuming we're emitting instructions), since every instruction can now 
> change size. I suspect that will make it impossible to turn this on by 
> default without breaking a lot of assembly code. Previously, only certain 
> instructions, like branches or arithmetic ops with constant arguments of 
> unknown value, could change size.




CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-21 Thread James Y Knight via Phabricator via cfe-commits
jyknight added a comment.

Thanks for the comments, they help a little. But it's still somewhat confusing, 
so let me write down what seems to be happening:

- Before emitting every instruction, a new MCMachineDependentFragment is now 
emitted, of one of the multiple types:
  - For most instructions, that'll be BranchPrefix.
  - For things that need branch-alignment, it'll be BranchPadding, unless 
there's a fused conditional before, in which case it's BranchSplit
  - For fused conditionals, it'll be FusedJccPadding.
- After emitting an instruction that needs branch-alignment, all of those 
previously-emitted MCMachineDependentFragment are updated to point to the 
branch's fragment.
- Thus, every MCDataFragment now only contains a single instruction (this 
property is depended upon for getInstSize, at least).

All the MCMachineDependentFragments in a region bounded by a branch at the end 
and either a branch or a fragment-type which is not type in {FT_Data, 
FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, 
will reference the ending branch instruction's fragment.

Then, when it comes time to do relaxation, every one of those 
machine-dependent-fragments has the opportunity to grow its instruction a 
little bit. The first instruction in a "block" will grow up to 5 segment 
prefixes (via modifying the BranchPrefix fragment), and then if more is needed, 
more prefixes will be added to the next instruction, and so on. Until you run 
out of instructions in the region. At which point the BranchPadding or 
FusedJccPadding types (right before the branch/fused-branch) will be able to 
emit nops to achieve the desired alignment.

An alternative would be to simply emit NOPs before branches as needed. That 
would be substantially simpler, since it would only require special handling 
for a branch or a fused-branch. I assume things were done this 
substantially-more-complex way in order to reduce performance cost of inserting 
NOP instructions? Are there numbers for how much better it is to use segment 
prefixes, vs a separate nop instruction? It seems a little bit surprising to me 
that it would be that important, but I don't know...

I'll note that the method here has the semantic issue of making it effectively 
impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming 
we're emitting instructions), since every instruction can now change size. I 
suspect that will make it impossible to turn this on by default without 
breaking a lot of assembly code. Previously, only certain instructions, like 
branches or arithmetic ops with constant arguments of unknown value, could 
change size.




Comment at: llvm/lib/MC/MCAssembler.cpp:1058
+/// is the address of the symbol, which would casuse performance regression.
+void MCAssembler::moveSymbol(const MCFragment *Src, MCFragment *Dst) const {
+  if (!(Src && Dst && Dst->getKind() == MCFragment::FT_MachineDependent))

I don't think this is necessary. AFAICT, the symbols should already be in the 
right place -- pointing to the relax fragment, not the instruction itself, 
without this. And removing all this moveSymbol/updateSymbolMap code doesn't 
make any tests fail.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread James Y Knight via Phabricator via cfe-commits
jyknight added a comment.

Overall comment: this whole change needs more comments, everywhere. Both for 
the added functions, and for the test cases. There is almost no description of 
what's happening, and it could really use it.




Comment at: llvm/lib/MC/MCAssembler.cpp:1041
+
+void MCAssembler::moveSymbol(const MCFragment *Src, MCFragment *Dst) const {
+  if (!(Src && Dst && Dst->getKind() == MCFragment::FT_MachineDependent))

I think this is broken -- moving symbols around like this risks causing 
evaluations of symbol offsets which may have already happened to be wrong. 

Also it's ugly, and I can't tell why it's necessary, because there's no 
comments.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:418
+
+bool X86AsmBackend::isReturn(const MCInst ) const {
+  unsigned Opcode = MI.getOpcode();

This set of functions down to isIndirectBranch() seems unnecessary. Pushing one 
line
  const MCInstrDesc  = MCII.get(Inst.getOpcode());
into needAlign(const MCInst ), and then just using InstDesc.isReturn() 
etc. would be fine.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:463
+  const X86::SecondMFInstKind BranchKind = classifySecond(Jcc, MCII);
+  return X86::isMacroFused(CmpKind,BranchKind);
+  llvm_unreachable("unknown fusion type");

please run something like "git clang-format HEAD~1" to re-format your patch.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:467
+
+char X86AsmBackend::choosePrefixValue(const MCInst ) const {
+  for (const auto  : MI) {

Comment on why this is doing what it's doing?



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:532
+
+bool X86AsmBackend::needAlignBranch() const {
+  return AlignBoundarySize != 0 &&

Confusing name, and doesn't need to be a separate function than needAlign(const 
MCAssembler &, MCSection *). Just merge it into that.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:537
+
+bool X86AsmBackend::needAlignJcc() const {
+  return AlignBoundarySize != 0 &&

These functions are only used in one place, and it doesn't make it more 
readable to split them off. Just merge them into needAlign(const MCInst ) 
or alignBranchesBegin as appropriate.

But also -- AlignBoundarySize shouldn't even need to be checked here, since 
it's already be checked in needAlign(const MCAssembler &, MCSection *), which 
is always called first.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:567
+
+bool X86AsmBackend::needAlign(const MCAssembler ,
+  MCSection *Sec) const {

Doesn't need to have the same name as the next needAlign, clearer if these two 
overloads are given different names.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:731-732
+char Prefix) const {
+  if (!(Prefix == 0x2e || Prefix == 0x36 || Prefix == 0x3e || Prefix == 0x26 ||
+Prefix == 0x64 || Prefix == 0x65))
+return false;

Why?



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h:106
+  // The classification for the first instruction in macro fusion.
+  enum class FirstMFInstKind {
+// TEST

"MF" is not a readable abbreviation here -- spell out "MacroFusion" instead. It 
was OK locally in the "X86MacroFusion.cpp" file, as the filename gave you a 
hint, but not here, where you have no such hint.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h:134
+  /// macro-fusion.
+  inline FirstMFInstKind classifyFirstOpcode(unsigned Opcode) {
+switch (Opcode) {

This function is also named poorly, after being moved to this more-generic 
location. "Classify" doesn't tell me anything about this being related to 
macro-fusion.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread Ed Maste via Phabricator via cfe-commits
emaste added a comment.

> I think the default policy discussion might be better had on llvm-dev than a 
> Phab review.

Ok, I agree with that, and also think it's useful to get this patch committed 
independent of a change to the defaults, which can be handled in a subsequent 
review.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread Craig Topper via Phabricator via cfe-commits
craig.topper added a comment.

In D70157#1747726 , @davezarzycki 
wrote:

> In D70157#1747640 , @jyknight wrote:
>
> > In D70157#1747551 , @davezarzycki 
> > wrote:
> >
> > > In D70157#1747510 , @xbolva00 
> > > wrote:
> > >
> > > > > Even though core2 isn't affected by the erratum, core2 code can run 
> > > > > on CPUs that do have the bug (and core2 is a popular target for code 
> > > > > that needs to run "everywhere"), therefore all target CPUs that 
> > > > > predate a hardware fix really
> > > >
> > > > So perf of all code/generic codegen will go down. This is not very 
> > > > acceptable. -1 for making this enabled by default for generic codegen.
> > >
> > >
> > > I'm okay with this not being the default if we document somewhere that 
> > > LLVM assumes that machines have up to date microcode installed.
> >
> >
> > So, my understanding is:
> >
> > 1. There is a CPU bug, that in very rare but unspecified circumstances, can 
> > cause branches of all kinds (conditional-jump, fused-conditional-jump, 
> > unconditional jump, call, return, indirect jumps and calls?) to go to the 
> > wrong place, if the instruction crosses a 64-byte cacheline boundary.
> > 2. In order to fix that bug, the new CPU firmware disables the faster 
> > decoded-icache mechanism, falling back to the legacy decoder for any 
> > 32-byte block of code which ends with one of those jumps (because it works 
> > in 32-byte blocks, not 64-byte blocks). Falling back to the legacy decoder 
> > can have a sometimes-significant performance cost. Of course, this is not 
> > the _only_ reason that the CPU falls back to the legacy decoder, but it's a 
> > significant new one.
> >
> >   This patch, then, spends some memory and icache space (and potentially 
> > decode-time) in order to try to get back the performance lost by the above 
> > fix. This can be worthwhile because if you can cause there not to be any 
> > 32-byte code sections that have a branch at the end, then the fallback to 
> > the legacy decoder won't trigger, and the performance will hopefully return 
> > to something comparable to the initial state. Unfortunately, while this 
> > ought to increase performance impact on all of the affected processors, it 
> > will have only negative effects on unaffected processors, since then you're 
> > spending memory and icache space, and not getting anything back in return.
> >
> >   Similarly, if you're doing this extra alignment on code that isn't 
> > executed repeatedly out of the icache, it's also going to be useless.
> >
> >   So, I'd say it's _not at all_ clear that it should be enabled by default.
> >
> >   I'm also a bit confused as to why the default is set as it is. Why is the 
> > default only fused+jcc+jmp, not all branch types?
>
>
> Strictly speaking, Intel doesn't say what happens if this bug occurs, nor do 
> they say that branches themselves are the problem, just "branch instruction 
> bytes" and that "unpredictable system behavior may occur". Compare and 
> contrast with the subsequent erratum where they explicitly say that x86, AVX, 
> and integer divide instructions (not instruction bytes) can have 
> unpredictable behavior. See SKX102 and SKX103: 
> https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
>
> But we're digressing. I think the default policy discussion might be better 
> had on llvm-dev than a Phab review.


I believe the unpredictable behavior erratum only affects conditional jumps. 
But the microcode update to workaround the erratum used a bigger hammer that 
effects all kinds of jumps, calls, and returns.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread David Zarzycki via Phabricator via cfe-commits
davezarzycki added a comment.

In D70157#1747640 , @jyknight wrote:

> In D70157#1747551 , @davezarzycki 
> wrote:
>
> > In D70157#1747510 , @xbolva00 
> > wrote:
> >
> > > > Even though core2 isn't affected by the erratum, core2 code can run on 
> > > > CPUs that do have the bug (and core2 is a popular target for code that 
> > > > needs to run "everywhere"), therefore all target CPUs that predate a 
> > > > hardware fix really
> > >
> > > So perf of all code/generic codegen will go down. This is not very 
> > > acceptable. -1 for making this enabled by default for generic codegen.
> >
> >
> > I'm okay with this not being the default if we document somewhere that LLVM 
> > assumes that machines have up to date microcode installed.
>
>
> So, my understanding is:
>
> 1. There is a CPU bug, that in very rare but unspecified circumstances, can 
> cause branches of all kinds (conditional-jump, fused-conditional-jump, 
> unconditional jump, call, return, indirect jumps and calls?) to go to the 
> wrong place, if the instruction crosses a 64-byte cacheline boundary.
> 2. In order to fix that bug, the new CPU firmware disables the faster 
> decoded-icache mechanism, falling back to the legacy decoder for any 32-byte 
> block of code which ends with one of those jumps (because it works in 32-byte 
> blocks, not 64-byte blocks). Falling back to the legacy decoder can have a 
> sometimes-significant performance cost. Of course, this is not the _only_ 
> reason that the CPU falls back to the legacy decoder, but it's a significant 
> new one.
>
>   This patch, then, spends some memory and icache space (and potentially 
> decode-time) in order to try to get back the performance lost by the above 
> fix. This can be worthwhile because if you can cause there not to be any 
> 32-byte code sections that have a branch at the end, then the fallback to the 
> legacy decoder won't trigger, and the performance will hopefully return to 
> something comparable to the initial state. Unfortunately, while this ought to 
> increase performance impact on all of the affected processors, it will have 
> only negative effects on unaffected processors, since then you're spending 
> memory and icache space, and not getting anything back in return.
>
>   Similarly, if you're doing this extra alignment on code that isn't executed 
> repeatedly out of the icache, it's also going to be useless.
>
>   So, I'd say it's _not at all_ clear that it should be enabled by default.
>
>   I'm also a bit confused as to why the default is set as it is. Why is the 
> default only fused+jcc+jmp, not all branch types?


Strictly speaking, Intel doesn't say what happens if this bug occurs, nor do 
they say that branches themselves are the problem, just "branch instruction 
bytes" and that "unpredictable system behavior may occur". Compare and contrast 
with the subsequent erratum where they explicitly say that x86, AVX, and 
integer divide instructions (not instruction bytes) can have unpredictable 
behavior. See SKX102 and SKX103: 
https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html

But we're digressing. I think the default policy discussion might be better had 
on llvm-dev than a Phab review.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread James Y Knight via Phabricator via cfe-commits
jyknight added a comment.

In D70157#1747551 , @davezarzycki 
wrote:

> In D70157#1747510 , @xbolva00 wrote:
>
> > > Even though core2 isn't affected by the erratum, core2 code can run on 
> > > CPUs that do have the bug (and core2 is a popular target for code that 
> > > needs to run "everywhere"), therefore all target CPUs that predate a 
> > > hardware fix really
> >
> > So perf of all code/generic codegen will go down. This is not very 
> > acceptable. -1 for making this enabled by default for generic codegen.
>
>
> I'm okay with this not being the default if we document somewhere that LLVM 
> assumes that machines have up to date microcode installed.


So, my understanding is:

1. There is a CPU bug, that in very rare but unspecified circumstances, can 
cause branches of all kinds (conditional-jump, fused-conditional-jump, 
unconditional jump, call, return, indirect jumps and calls?) to go to the wrong 
place, if the instruction crosses a 64-byte cacheline boundary.
2. In order to fix that bug, the new CPU firmware disables the faster 
decoded-icache mechanism, falling back to the legacy decoder for any 32-byte 
block of code which ends with one of those jumps (because it works in 32-byte 
blocks, not 64-byte blocks). Falling back to the legacy decoder can have a 
sometimes-significant performance cost. Of course, this is not the _only_ 
reason that the CPU falls back to the legacy decoder, but it's a significant 
new one.

This patch, then, spends some memory and icache space (and potentially 
decode-time) in order to try to get back the performance lost by the above fix. 
This can be worthwhile because if you can cause there not to be any 32-byte 
code sections that have a branch at the end, then the fallback to the legacy 
decoder won't trigger, and the performance will hopefully return to something 
comparable to the initial state. Unfortunately, while this ought to increase 
performance impact on all of the affected processors, it will have only 
negative effects on unaffected processors, since then you're spending memory 
and icache space, and not getting anything back in return.

Similarly, if you're doing this extra alignment on code that isn't executed 
repeatedly out of the icache, it's also going to be useless.

So, I'd say it's _not at all_ clear that it should be enabled by default.

I'm also a bit confused as to why the default is set as it is. Why is the 
default only fused+jcc+jmp, not all branch types?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread David Zarzycki via Phabricator via cfe-commits
davezarzycki added a comment.

In D70157#1746793 , @MaskRay wrote:

> On x86, the preferred function alignment is 16 
> (https://github.com/llvm/llvm-project/blob/arcpatch-D70157/llvm/lib/Target/X86/X86ISelLowering.cpp#L1893),
>  which is the default function alignment in text sections. If the 
> cross-boundary decision is made with alignment=32 
> (--x86-align-branch-boundary=32) in mind, and the section alignment is still 
> 16 (not increased to 32 or higher), the linker may place the section at an 
> address which equals 16 modulo 32, the section contents will thus shift by 
> 16. The instructions that do not cross the boundary in the object files may 
> cross the boundary in the linker output. Have you considered increasing the 
> section alignment to 32?
>
> Shall we default to -mbranches-within-32B-boundaries if the specified -march= 
> or -mtune= may be affected by the erratum?


That isn't good enough. Even though core2 isn't affected by the erratum, core2 
code can run on CPUs that do have the bug (and core2 is a popular target for 
code that needs to run "everywhere"), therefore all target CPUs that predate a 
hardware fix really ought to have -mbranches-within-32B-boundaries by default.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1746793 , @MaskRay wrote:

> On x86, the preferred function alignment is 16 
> (https://github.com/llvm/llvm-project/blob/arcpatch-D70157/llvm/lib/Target/X86/X86ISelLowering.cpp#L1893),
>  which is the default function alignment in text sections. If the 
> cross-boundary decision is made with alignment=32 
> (--x86-align-branch-boundary=32) in mind, and the section alignment is still 
> 16 (not increased to 32 or higher), the linker may place the section at an 
> address which equals 16 modulo 32, the section contents will thus shift by 
> 16. The instructions that do not cross the boundary in the object files may 
> cross the boundary in the linker output. Have you considered increasing the 
> section alignment to 32?
>
> Shall we default to -mbranches-within-32B-boundaries if the specified -march= 
> or -mtune= may be affected by the erratum?


Yes, I have considered increasing the section alignment to 32 if any branch 
need to be aligned in this section. You can find related code at the end of 
fuction `alignBranchesEnd`.

I  think the users should have the right not to align branches even if their 
arch may be affected by the erratum, so currently I don't do that.




Comment at: llvm/lib/MC/MCFragment.cpp:435
+OS << "\n   ";
+OS << "Subtype:" << MF->SubKind << "Size: " << MF->getSize();
+break;

MaskRay wrote:
> ```
> error: use of overloaded operator '<<' is ambiguous (with opera
> nd types 'llvm::raw_ostream' and 'llvm::MCMachineDependentFragment::SubType')
> ```
> 
> It seems you haven't defined `raw_ostream <<(raw_ostream , 
> MCMachineDependentFragment::SubType)`. I believe you also missed a space 
> before `Size: `.
`llvm::MCMachineDependentFragment::SubType`  is an enum type, it seems okay for 
me to not define `raw_ostream <<(raw_ostream , 
MCMachineDependentFragment::SubType)`  and I could not reproduce the error.  
However, it's not clear to print a enum in the dump information. I change the 
code to print a string.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:88
+SmallVector BranchTypes;
+StringRef(Val).split(BranchTypes, '-', -1, false);
+for (auto BranchType : BranchTypes) {

MaskRay wrote:
> skan wrote:
> > craig.topper wrote:
> > > skan wrote:
> > > > craig.topper wrote:
> > > > > skan wrote:
> > > > > > chandlerc wrote:
> > > > > > > I feel like a comma-separated list would be a bit more clear...
> > > > > > We can't use comma-separated list because we need pass the option 
> > > > > > with flto.  
> > > > > > "-Wl,-plugin-opt=--x86-align-branch-boundary=32,--plugin-opt=-x86-align-branch=fused,jcc,jmp,--plugin-opt=-x86-align-branch-prefix-size=5"
> > > > > >  would cause a compile fail because "jcc" was recognized as another 
> > > > > > option rather than a part of option 
> > > > > > "-x86-align-branch=fused,jcc,jmp" 
> > > > > Isn't there some way to nest quotes into the part of after 
> > > > > -plugin-opt= ?
> > > > I tried to use --plugin-opt=-x86-align-branch="fused,jcc,jmp", it 
> > > > didn't work.
> > > What if you put —plugin-opt=“-x86-align-branch=fused,jcc,jmp”
> > It doesn't work too.
> Both gcc and clang just split the -Wl argument by comma. There is no escape 
> character. For reference, 
> https://sourceware.org/ml/binutils/2019-11/msg00173.html is the GNU as patch 
> on the binutils side.
> 
> Have you considered `+` plus? I think it may be more intuitive.
Yes, I agree with you. Done



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:173
+
+  bool needAlignBranch() const;
+  bool needAlignJcc() const;

MaskRay wrote:
> We can generalize these functions with a function that takes a parameter.
We already have a generalized function `bool needAlign(const MCInst ) 
const`.`needAlignJcc()` is only a helper function that makes code more 
readable.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:411
+  cast(*Operand.getExpr()).getSymbol().getName();
+  if (SymbolName.contains("__tls_get_addr"))
+return true;

MaskRay wrote:
> Check 64bit, then use exact comparison with either `___tls_get_addr` or 
> `__tls_get_addr`
Why? There exists TLS Call in 32bit mode. Does  
`SymbolName.contains("__tls_get_addr")` possibly include more calls rather TLS 
Call?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread Ed Maste via Phabricator via cfe-commits
emaste added a comment.

> Shall we default to -mbranches-within-32B-boundaries if the specified -march= 
> or -mtune= may be affected by the erratum?

I think we should enable it based on `-mtune` specifying an affected CPU (and 
implicitly based on `-march` if `-mtune` is not specified).


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread David Zarzycki via Phabricator via cfe-commits
davezarzycki added a comment.

In D70157#1747510 , @xbolva00 wrote:

> > Even though core2 isn't affected by the erratum, core2 code can run on CPUs 
> > that do have the bug (and core2 is a popular target for code that needs to 
> > run "everywhere"), therefore all target CPUs that predate a hardware fix 
> > really
>
> So perf of all code/generic codegen will go down. This is not very 
> acceptable. -1 for making this enabled by default for generic codegen.


I'm okay with this not being the default if we document somewhere that LLVM 
assumes that machines have up to date microcode installed.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread Kan Shengchen via Phabricator via cfe-commits
skan marked 3 inline comments as done.
skan added inline comments.



Comment at: llvm/test/MC/X86/i386-align-branch-1a.s:1
+# RUN: llvm-mc -filetype=obj -triple i386-unknown-unknown 
--x86-align-branch-boundary=32 --x86-align-branch=fused-jcc-jmp  
--x86-align-branch-prefix-size=5 %s | llvm-objdump -d  - | FileCheck %s
+# RUN: llvm-mc -filetype=obj -triple i386-unknown-unknown 
--x86-branches-within-32B-boundaries %s | llvm-objdump -d  - | FileCheck %s

MaskRay wrote:
> MaskRay wrote:
> > If you want to test that `--x86-branches-within-32B-boundaries` expands to 
> > the 3 other `--x86-*` options.
> > 
> > ```
> > ... -o %t
> > ... -o %t2
> > cmp %t %t2
> > FileCheck --input-file %t
> > ```
> We need a file-level comment describing the purpose of the test.
> 
> What do `1a`, `2a`, and `5a` mean? If we can find appropriate, descriptive 
> names, use them. Don't simply copy the binutils tests. If tests exist for 
> each of the used instruction prefix, write a comment to make that clear.
i will do it later



Comment at: llvm/test/MC/X86/i386-align-branch-1a.s:9
+# CHECK:  foo:
+# CHECK:0: 65 65 65 a3 01 00 00 00 movl%eax, %gs:1
+# CHECK:8: 55  pushl   %ebp

MaskRay wrote:
> MaskRay wrote:
> > Use `CHECK-NEXT:` (see https://llvm.org/docs/CommandGuide/FileCheck.html)
> > 
> > Add comments where REX prefixes or NOPs are used.
> Don't mix tabs with spaces. The columns do not align in an editor.
Done



Comment at: llvm/test/MC/X86/i386-align-branch-1a.s:9
+# CHECK:  foo:
+# CHECK:0: 65 65 65 a3 01 00 00 00 movl%eax, %gs:1
+# CHECK:8: 55  pushl   %ebp

skan wrote:
> MaskRay wrote:
> > MaskRay wrote:
> > > Use `CHECK-NEXT:` (see https://llvm.org/docs/CommandGuide/FileCheck.html)
> > > 
> > > Add comments where REX prefixes or NOPs are used.
> > Don't mix tabs with spaces. The columns do not align in an editor.
> Done
I will do it later


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread Dávid Bolvanský via Phabricator via cfe-commits
xbolva00 added a comment.

> Even though core2 isn't affected by the erratum, core2 code can run on CPUs 
> that do have the bug (and core2 is a popular target for code that needs to 
> run "everywhere"), therefore all target CPUs that predate a hardware fix 
> really

So perf of all code/generic codegen will go down. This is not very acceptable. 
-1 for making this enabled by default for generic codegen.

Also, you will pessimize codegen for no real reason for AMD.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-16 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1747428 , @davezarzycki 
wrote:

> In D70157#1746793 , @MaskRay wrote:
>
> > On x86, the preferred function alignment is 16 
> > (https://github.com/llvm/llvm-project/blob/arcpatch-D70157/llvm/lib/Target/X86/X86ISelLowering.cpp#L1893),
> >  which is the default function alignment in text sections. If the 
> > cross-boundary decision is made with alignment=32 
> > (--x86-align-branch-boundary=32) in mind, and the section alignment is 
> > still 16 (not increased to 32 or higher), the linker may place the section 
> > at an address which equals 16 modulo 32, the section contents will thus 
> > shift by 16. The instructions that do not cross the boundary in the object 
> > files may cross the boundary in the linker output. Have you considered 
> > increasing the section alignment to 32?
> >
> > Shall we default to -mbranches-within-32B-boundaries if the specified 
> > -march= or -mtune= may be affected by the erratum?
>
>
> That isn't good enough. Even though core2 isn't affected by the erratum, 
> core2 code can run on CPUs that do have the bug (and core2 is a popular 
> target for code that needs to run "everywhere"), therefore all target CPUs 
> that predate a hardware fix really


ought to have -mbranches-within-32B-boundaries by default.

Make sense, I will enable it later.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-14 Thread annita.zhang via Phabricator via cfe-commits
annita.zhang added a comment.

In D70157#1744243 , @chandlerc wrote:

> Thanks for sending this out!
>
> I've made a minor comment on the flag structure, but my bigger question would 
> be: can you summarize the performance hit and code size hit you've seen 
> across some benchmarks? SPEC and the LLVM test suite would be nice, and maybe 
> code size alone for some large binary like Chrome, Firefox, or Safari?
>
> I'd be particularly interested in comparing the performance & code size hit 
> incurred by the suggested "fused,jcc,jmp" set compared to all (adding call, 
> ret, and indirect). If there is a big drop in either, I'd love to know which 
> of these incurs the large drop.
>
> Thanks,
> -Chandler


We are working on the performance and code size evaluation. Will update once 
it's available.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-14 Thread Xiang Zhang via Phabricator via cfe-commits
xiangzhangllvm added a comment.

In D70157#1746793 , @MaskRay wrote:

> On x86, the preferred function alignment is 16 
> (https://github.com/llvm/llvm-project/blob/arcpatch-D70157/llvm/lib/Target/X86/X86ISelLowering.cpp#L1893),
>  which is the default function alignment in text sections. If the 
> cross-boundary decision is made with alignment=32 
> (--x86-align-branch-boundary=32) in mind, and the section alignment is still 
> 16 (not increased to 32 or higher), the linker may place the section at an 
> address which equals 16 modulo 32, the section contents will thus shift by 
> 16. The instructions that do not cross the boundary in the object files may 
> cross the boundary in the linker output. Have you considered increasing the 
> section alignment to 32?
>
> Shall we default to -mbranches-within-32B-boundaries if the specified -march= 
> or -mtune= may be affected by the erratum?


Hi Fangrui, Here will set the section alignment to 32, 
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:658


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-14 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added inline comments.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:173
+
+  bool needAlignBranch() const;
+  bool needAlignJcc() const;

We can generalize these functions with a function that takes a parameter.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:411
+  cast(*Operand.getExpr()).getSymbol().getName();
+  if (SymbolName.contains("__tls_get_addr"))
+return true;

Check 64bit, then use exact comparison with either `___tls_get_addr` or 
`__tls_get_addr`



Comment at: llvm/test/MC/X86/i386-align-branch-1a.s:1
+# RUN: llvm-mc -filetype=obj -triple i386-unknown-unknown 
--x86-align-branch-boundary=32 --x86-align-branch=fused-jcc-jmp  
--x86-align-branch-prefix-size=5 %s | llvm-objdump -d  - | FileCheck %s
+# RUN: llvm-mc -filetype=obj -triple i386-unknown-unknown 
--x86-branches-within-32B-boundaries %s | llvm-objdump -d  - | FileCheck %s

MaskRay wrote:
> If you want to test that `--x86-branches-within-32B-boundaries` expands to 
> the 3 other `--x86-*` options.
> 
> ```
> ... -o %t
> ... -o %t2
> cmp %t %t2
> FileCheck --input-file %t
> ```
We need a file-level comment describing the purpose of the test.

What do `1a`, `2a`, and `5a` mean? If we can find appropriate, descriptive 
names, use them. Don't simply copy the binutils tests. If tests exist for each 
of the used instruction prefix, write a comment to make that clear.



Comment at: llvm/test/MC/X86/i386-align-branch-6a.s:10
+# CHECK:2: 89 e5   movl%esp, %ebp
+# CHECK:4: 90  nop
+# CHECK:5: 90  nop

Use something like `# CHECK-COUNT-20:90  nop` (address is intentionally 
omitted) to skip a large number of NOPs.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-14 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added a comment.

On x86, the preferred function alignment is 16 
(https://github.com/llvm/llvm-project/blob/arcpatch-D70157/llvm/lib/Target/X86/X86ISelLowering.cpp#L1893),
 which is the default function alignment in text sections. If the 
cross-boundary decision is made with alignment=32 
(--x86-align-branch-boundary=32) in mind, and the section alignment is still 16 
(not increased to 32 or higher), the linker may place the section at an address 
which equals 16 modulo 32, the section contents will thus shift by 16. The 
instructions that do not cross the boundary in the object files may cross the 
boundary in the linker output. Have you considered increasing the section 
alignment to 32?

Shall we default to -mbranches-within-32B-boundaries if the specified -march= 
or -mtune= may be affected by the erratum?




Comment at: llvm/test/MC/X86/i386-align-branch-1a.s:1
+# RUN: llvm-mc -filetype=obj -triple i386-unknown-unknown 
--x86-align-branch-boundary=32 --x86-align-branch=fused-jcc-jmp  
--x86-align-branch-prefix-size=5 %s | llvm-objdump -d  - | FileCheck %s
+# RUN: llvm-mc -filetype=obj -triple i386-unknown-unknown 
--x86-branches-within-32B-boundaries %s | llvm-objdump -d  - | FileCheck %s

If you want to test that `--x86-branches-within-32B-boundaries` expands to the 
3 other `--x86-*` options.

```
... -o %t
... -o %t2
cmp %t %t2
FileCheck --input-file %t
```



Comment at: llvm/test/MC/X86/i386-align-branch-1a.s:9
+# CHECK:  foo:
+# CHECK:0: 65 65 65 a3 01 00 00 00 movl%eax, %gs:1
+# CHECK:8: 55  pushl   %ebp

Use `CHECK-NEXT:` (see https://llvm.org/docs/CommandGuide/FileCheck.html)

Add comments where REX prefixes or NOPs are used.



Comment at: llvm/test/MC/X86/i386-align-branch-1a.s:9
+# CHECK:  foo:
+# CHECK:0: 65 65 65 a3 01 00 00 00 movl%eax, %gs:1
+# CHECK:8: 55  pushl   %ebp

MaskRay wrote:
> Use `CHECK-NEXT:` (see https://llvm.org/docs/CommandGuide/FileCheck.html)
> 
> Add comments where REX prefixes or NOPs are used.
Don't mix tabs with spaces. The columns do not align in an editor.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-14 Thread Fangrui Song via Phabricator via cfe-commits
MaskRay added inline comments.



Comment at: llvm/lib/MC/MCFragment.cpp:435
+OS << "\n   ";
+OS << "Subtype:" << MF->SubKind << "Size: " << MF->getSize();
+break;

```
error: use of overloaded operator '<<' is ambiguous (with opera
nd types 'llvm::raw_ostream' and 'llvm::MCMachineDependentFragment::SubType')
```

It seems you haven't defined `raw_ostream <<(raw_ostream , 
MCMachineDependentFragment::SubType)`. I believe you also missed a space before 
`Size: `.



Comment at: llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp:88
+SmallVector BranchTypes;
+StringRef(Val).split(BranchTypes, '-', -1, false);
+for (auto BranchType : BranchTypes) {

skan wrote:
> craig.topper wrote:
> > skan wrote:
> > > craig.topper wrote:
> > > > skan wrote:
> > > > > chandlerc wrote:
> > > > > > I feel like a comma-separated list would be a bit more clear...
> > > > > We can't use comma-separated list because we need pass the option 
> > > > > with flto.  
> > > > > "-Wl,-plugin-opt=--x86-align-branch-boundary=32,--plugin-opt=-x86-align-branch=fused,jcc,jmp,--plugin-opt=-x86-align-branch-prefix-size=5"
> > > > >  would cause a compile fail because "jcc" was recognized as another 
> > > > > option rather than a part of option "-x86-align-branch=fused,jcc,jmp" 
> > > > Isn't there some way to nest quotes into the part of after -plugin-opt= 
> > > > ?
> > > I tried to use --plugin-opt=-x86-align-branch="fused,jcc,jmp", it didn't 
> > > work.
> > What if you put —plugin-opt=“-x86-align-branch=fused,jcc,jmp”
> It doesn't work too.
Both gcc and clang just split the -Wl argument by comma. There is no escape 
character. For reference, 
https://sourceware.org/ml/binutils/2019-11/msg00173.html is the GNU as patch on 
the binutils side.

Have you considered `+` plus? I think it may be more intuitive.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-14 Thread Tom Stellard via Phabricator via cfe-commits
tstellar added a comment.

Are there any optimizations in lld that might undo the 32-byte alignment 
emitted by the compiler?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D70157: Align branches within 32-Byte boundary

2019-11-13 Thread Kan Shengchen via Phabricator via cfe-commits
skan added a comment.

In D70157#1745128 , @craig.topper 
wrote:

> In D70157#1745127 , @skan wrote:
>
> > In D70157#1745125 , @craig.topper 
> > wrote:
> >
> > > I've definitely passed something like 
> > > -Wl,--plugin-opt=-debug-only=isel,dagcombine on the command line before.
> >
> >
> > I don't know how you could pass it succcessfully.  It doesn't work for me 
> > either.  Here is the complain:
> >
> >   clang --gcc-toolchain=xx -fuse-ld=gold   -march=skylake-avx512 
> > -mfpmath=sse -Ofast -funroll-loops -flto   
> > -Wl,-plugin-opt=-debug-only=isel,dagcombine xxx.o main.o
> >  -lm -o xxx
> >   /usr/bin/ld.gold: error: cannot open dagcombine: No such file or directory
> >
>
>
> Does it work as -Wl,-plugin-opt="-debug-only=isel,dagcombine" ?


No


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


  1   2   >