Store forwarding stall cost is usually much higher compared with
a load hitting L1 cache. For instance, on Haswell, the latter is
~4 cycles, while the store forwarding stalls cost about 10 cycles
more than a successful store forwarding, which is roughly 15
cycles. In some scenarios, the store forwarding stalls can be as
high as 50 cycles. See Agner's documentation.
In other words, the optimizer needs to be taught to avoid
defeating the HW pipeline feature as much as possible.
David
On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfin...@anl.gov
<mailto:hfin...@anl.gov>> wrote:
On 09/03/2017 03:44 PM, Wei Mi wrote:
On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel
<hfin...@anl.gov <mailto:hfin...@anl.gov>> wrote:
On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David
Li <davi...@google.com <mailto:davi...@google.com>>
wrote:
On Tue, Aug 22, 2017 at 6:37 PM, Chandler
Carruth via Phabricator
<revi...@reviews.llvm.org
<mailto:revi...@reviews.llvm.org>> wrote:
chandlerc added a comment.
I'm really not a fan of the degree of
complexity and subtlety that this
introduces into the frontend, all to
allow particular backend
optimizations.
I feel like this is Clang working around
a fundamental deficiency in
LLVM
and we should instead find a way to fix
this in LLVM itself.
As has been pointed out before, user code
can synthesize large integers
that small bit sequences are extracted
from, and Clang and LLVM should
handle those just as well as actual
bitfields.
Can we see how far we can push the LLVM
side before we add complexity to
Clang here? I understand that there
remain challenges to LLVM's stuff,
but I
don't think those challenges make *all*
of the LLVM improvements off the
table, I don't think we've exhausted all
ways of improving the LLVM
changes
being proposed, and I think we should
still land all of those and
re-evaluate how important these issues
are when all of that is in place.
The main challenge of doing this in LLVM is
that inter-procedural
analysis
(and possibly cross module) is needed (for
store forwarding issues).
Wei, perhaps you can provide concrete test
case to illustrate the issue
so
that reviewers have a good understanding.
David
Here is a runable testcase:
-------------------- 1.cc ------------------------
class A {
public:
unsigned long f1:2;
unsigned long f2:6;
unsigned long f3:8;
unsigned long f4:4;
};
A a;
unsigned long b;
unsigned long N = 1000000000;
__attribute__((noinline))
void foo() {
a.f3 = 3;
}
__attribute__((noinline))
void goo() {
b = a.f3;
}
int main() {
unsigned long i;
for (i = 0; i < N; i++) {
foo();
goo();
}
}
------------------------------------------------------------
Now trunk takes about twice running time compared
with trunk + this
patch. That is because trunk shrinks the store of
a.f3 in foo (Done by
DagCombiner) but not shrink the load of a.f3 in
goo, so store
forwarding will be blocked.
I can confirm that this is true on Haswell and also
on an POWER8.
Interestingly, on a POWER7, the performance is the
same either way (on the
good side). I ran the test case as presented and
where I replaced f3 with a
non-bitfield unsigned char member. Thinking that the
POWER7 result might be
because it's big-Endian, I flipped the order of the
fields, and found that
the version where f3 is not a bitfield is faster than
otherwise, but only by
12.5%.
Why, in this case, don't we shrink the load? It seems
like we should (and it
seems like a straightforward case).
Thanks again,
Hal
Hal, thanks for trying the test.
Yes, it is straightforward to shrink the load in the
test. I can
change the testcase a little to show why it is sometime
difficult to
shrink the load:
class A {
public:
unsigned long f1:16;
unsigned long f2:16;
unsigned long f3:16;
unsigned long f4:8;
};
A a;
bool b;
unsigned long N = 1000000000;
__attribute__((noinline))
void foo() {
a.f4 = 3;
}
__attribute__((noinline))
void goo() {
b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
}
int main() {
unsigned long i;
for (i = 0; i < N; i++) {
foo();
goo();
}
}
For the load a.f4 in goo, it is diffcult to motivate its
shrink after
instcombine because the comparison with a.f3 and the
comparison with
a.f4 are merged:
define void @_Z3goov() local_unnamed_addr #0 {
%1 = load i64, i64* bitcast (%class.A* @a to i64*),
align 8
%2 = and i64 %1, 0xffffff00000000
%3 = icmp eq i64 %2, 0xffff00000000
%4 = zext i1 %3 to i8
store i8 %4, i8* @b, align 1, !tbaa !2
ret void
}
Exactly. But this behavior is desirable, locally. There's no
good answer here: We either generate extra load traffic here
(because we need to load the fields separately), or we widen
the store (which generates extra load traffic there). Do you
know, in terms of performance, which is better in this case
(i.e., is it better to widen the store or split the load)?
-Hal
Thanks,
Wei.
The testcases shows the potential problem of
store shrinking. Before
we decide to do store shrinking, we need to know
all the related loads
will be shrunk, and that requires IPA analysis.
Otherwise, when load
shrinking was blocked for some difficult case
(Like the instcombine
case described in
https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html
<https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html>),
performance regression will happen.
Wei.
Repository:
rL LLVM
https://reviews.llvm.org/D36562
<https://reviews.llvm.org/D36562>
_______________________________________________
llvm-commits mailing list
llvm-comm...@lists.llvm.org
<mailto:llvm-comm...@lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory