On 09/03/2017 11:06 PM, Xinliang David Li wrote:
I think we can think this in another way.

For modern CPU architectures which supports store forwarding with store queues, it is generally not "safe" to blindly do local optimizations to widen the load/stores

Why not widen stores? Generally the problem with store forwarding is where the load is wider than the store (plus alignment issues).

without sophisticated inter-procedural analysis. Doing so will run the risk of greatly reduce performance of some programs. Keep the naturally aligned load/store using its natural type is safer.

Does it make sense?

It makes sense. I could, however, say the same thing about inlining. We need to make inlining decisions locally, but they have global impact. Nevertheless, we need to make inlining decisions, and there's no practical way to make them in a truly non-local way.

We also don't pessimize common cases to improve performance in rare cases. In the common case, reducing pressure on the memory units, and reducing the critical path, seem likely to me to be optimal. If that's not true, or doing otherwise has negligible cost (but can prevent rare downsides), we should certainly consider those options.

And none of this answers the question of whether it's better to have the store wider or the load split and narrow.

Thanks again,
Hal


David



On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel <hfin...@anl.gov <mailto:hfin...@anl.gov>> wrote:


    On 09/03/2017 10:38 PM, Xinliang David Li wrote:
    Store forwarding stall cost is usually much higher compared with
    a load hitting L1 cache. For instance, on Haswell,  the latter is
    ~4 cycles, while the store forwarding stalls cost about 10 cycles
    more than a successful store forwarding, which is roughly 15
    cycles. In some scenarios, the store forwarding stalls can be as
    high as 50 cycles. See Agner's documentation.

    I understand. As I understand it, there are two potential ways to
    fix this problem:

     1. You can make the store wider (to match the size of the wide
    load, thus permitting forwarding).
     2. You can make the load smaller (to match the size of the small
    store, thus permitting forwarding).

    At least in this benchmark, which is a better solution?

    Thanks again,
    Hal



    In other words, the optimizer needs to be taught to avoid
    defeating  the HW pipeline feature as much as possible.

    David

    On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfin...@anl.gov
    <mailto:hfin...@anl.gov>> wrote:


        On 09/03/2017 03:44 PM, Wei Mi wrote:

            On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel
            <hfin...@anl.gov <mailto:hfin...@anl.gov>> wrote:

                On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:

                    On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David
                    Li <davi...@google.com <mailto:davi...@google.com>>
                    wrote:


                        On Tue, Aug 22, 2017 at 6:37 PM, Chandler
                        Carruth via Phabricator
                        <revi...@reviews.llvm.org
                        <mailto:revi...@reviews.llvm.org>> wrote:

                            chandlerc added a comment.

                            I'm really not a fan of the degree of
                            complexity and subtlety that this
                            introduces into the frontend, all to
                            allow particular backend
                            optimizations.

                            I feel like this is Clang working around
                            a fundamental deficiency in
                            LLVM
                            and we should instead find a way to fix
                            this in LLVM itself.

                            As has been pointed out before, user code
                            can synthesize large integers
                            that small bit sequences are extracted
                            from, and Clang and LLVM should
                            handle those just as well as actual
                            bitfields.

                            Can we see how far we can push the LLVM
                            side before we add complexity to
                            Clang here? I understand that there
                            remain challenges to LLVM's stuff,
                            but I
                            don't think those challenges make *all*
                            of the LLVM improvements off the
                            table, I don't think we've exhausted all
                            ways of improving the LLVM
                            changes
                            being proposed, and I think we should
                            still land all of those and
                            re-evaluate how important these issues
                            are when all of that is in place.


                        The main challenge of doing  this in LLVM is
                        that inter-procedural
                        analysis
                        (and possibly cross module) is needed (for
                        store forwarding issues).

                        Wei, perhaps you can provide concrete test
                        case to illustrate the issue
                        so
                        that reviewers have a good understanding.

                        David

                    Here is a runable testcase:
                    -------------------- 1.cc ------------------------
                    class A {
                    public:
                        unsigned long f1:2;
                        unsigned long f2:6;
                        unsigned long f3:8;
                        unsigned long f4:4;
                    };
                    A a;
                    unsigned long b;
                    unsigned long N = 1000000000;

                    __attribute__((noinline))
                    void foo() {
                        a.f3 = 3;
                    }

                    __attribute__((noinline))
                    void goo() {
                        b = a.f3;
                    }

                    int main() {
                        unsigned long i;
                        for (i = 0; i < N; i++) {
                          foo();
                          goo();
                        }
                    }
                    ------------------------------------------------------------
                    Now trunk takes about twice running time compared
                    with trunk + this
                    patch. That is because trunk shrinks the store of
                    a.f3 in foo (Done by
                    DagCombiner) but not shrink the load of a.f3 in
                    goo, so store
                    forwarding will be blocked.


                I can confirm that this is true on Haswell and also
                on an POWER8.
                Interestingly, on a POWER7, the performance is the
                same either way (on the
                good side). I ran the test case as presented and
                where I replaced f3 with a
                non-bitfield unsigned char member. Thinking that the
                POWER7 result might be
                because it's big-Endian, I flipped the order of the
                fields, and found that
                the version where f3 is not a bitfield is faster than
                otherwise, but only by
                12.5%.

                Why, in this case, don't we shrink the load? It seems
                like we should (and it
                seems like a straightforward case).

                Thanks again,
                Hal

            Hal, thanks for trying the test.

            Yes, it is straightforward to shrink the load in the
            test. I can
            change the testcase a little to show why it is sometime
            difficult to
            shrink the load:

            class A {
            public:
               unsigned long f1:16;
               unsigned long f2:16;
               unsigned long f3:16;
               unsigned long f4:8;
            };
            A a;
            bool b;
            unsigned long N = 1000000000;

            __attribute__((noinline))
            void foo() {
               a.f4 = 3;
            }

            __attribute__((noinline))
            void goo() {
               b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
            }

            int main() {
               unsigned long i;
               for (i = 0; i < N; i++) {
                 foo();
                 goo();
               }
            }

            For the load a.f4 in goo, it is diffcult to motivate its
            shrink after
            instcombine because the comparison with a.f3 and the
            comparison with
            a.f4 are merged:

            define void @_Z3goov() local_unnamed_addr #0 {
               %1 = load i64, i64* bitcast (%class.A* @a to i64*),
            align 8
               %2 = and i64 %1, 0xffffff00000000
               %3 = icmp eq i64 %2, 0xffff00000000
               %4 = zext i1 %3 to i8
               store i8 %4, i8* @b, align 1, !tbaa !2
               ret void
            }


        Exactly. But this behavior is desirable, locally. There's no
        good answer here: We either generate extra load traffic here
        (because we need to load the fields separately), or we widen
        the store (which generates extra load traffic there). Do you
        know, in terms of performance, which is better in this case
        (i.e., is it better to widen the store or split the load)?

         -Hal



            Thanks,
            Wei.

                    The testcases shows the potential problem of
                    store shrinking. Before
                    we decide to do store shrinking, we need to know
                    all the related loads
                    will be shrunk,  and that requires IPA analysis.
                    Otherwise, when load
                    shrinking was blocked for some difficult case
                    (Like the instcombine
                    case described in
                    
https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html
                    
<https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html>),
                    performance regression will happen.

                    Wei.



                            Repository:
                                rL LLVM

                            https://reviews.llvm.org/D36562
                            <https://reviews.llvm.org/D36562>



                    _______________________________________________
                    llvm-commits mailing list
                    llvm-comm...@lists.llvm.org
                    <mailto:llvm-comm...@lists.llvm.org>
                    http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
                    
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>


                --
                Hal Finkel
                Lead, Compiler Technology and Programming Languages
                Leadership Computing Facility
                Argonne National Laboratory


-- Hal Finkel
        Lead, Compiler Technology and Programming Languages
        Leadership Computing Facility
        Argonne National Laboratory



-- Hal Finkel
    Lead, Compiler Technology and Programming Languages
    Leadership Computing Facility
    Argonne National Laboratory



--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to