Hi Eliot,

Thanks for the response. The unrolled loop, despite having the same
dependency across "j", can send multiple loads simultaneously. So the
limitation might not be due to that dependency across "j" of different
iterations. But in the non-unrolled loop, the control dependency is there,
which goes away when the loop is unrolled. But even without any
speculation, gem5 could have scheduled loads as follows:

first schedule load A[ k ]
then, compare i with N
then schedule load A [ k + 1 ] if i < N

But what is happening is A[ k + 1 ] is scheduled only after load A[ k ] is
completed. Is that completion necessary? It seems it isn't. The memory
system is underutilised.

Thanks and regards,
Aritra


On Fri, Oct 7, 2022 at 10:48 PM Eliot Moss <m...@cs.umass.edu> wrote:

> On 10/7/2022 1:13 PM, Eliot Moss wrote:
> > On 10/7/2022 1:03 PM, Aritra Bagchi wrote:
> >> Hi all,
> >>
> >> Any suggestions on this are most helpful.
> >>
> >> Thanks and regards,
> >> Aritra
> >
> > My guess is that it is because the non-unrolled loop
> > has a test of i against 1000 before each access to A[i].
> > That test guards the load, so must be completed before
> > the load can proceed.  It could also be because of the
> > way j is used - the next update cannot proceed until
> > the last one finishes.  It might be helpful to loop
> > at the actual instructions involved, but the control
> > dependency could be an issue.
> >
> > The unrolled loop avoids both of these possible
> > dependencies.
> >
> > I further observe that if we are talking about an
> > Intel processor. those processor handle loads in the
> > order the program presents them.  Not sure if that
> > has any impact here.  Also unsure whether cpu
> > speculative execution plays a role (which would actually
> > improve matters).
> >
> > Best - Eliot Moss
> >
> >> On Thu, Oct 6, 2022 at 6:01 PM Aritra Bagchi <bagchi95ari...@gmail.com
> >> <mailto:bagchi95ari...@gmail.com>> wrote:
> >>
> >>     Hi all,
> >>
> >>     *for (i = 0; i < 1000; i++) {*
> >>     *      j = j + A[ i ]*
> >>     *}*
> >>
> >>     Suppose such a loop program is executed on gem5 (single-core
> execution, with O3 CU model). In
> >>     that case, the memory hierarchy gets to see only one access at a
> time, e.g. only after A[ k ] is
> >>     completed, A [ k + 1 ] access is sent to the memory hierarchy.
> Whereas, if the loop is unrolled
> >>     (on i), multiple memory accesses are seen simultaneously. Why is
> that so? The memory loads could
> >>     be serviced independently (even without unrolling the loop), so why
> is gem5 taking such a
> >>     conservative approach?
> >>
> >>     Any form of help/suggestion is highly appreciated.
> >>
> >>     Thanks and regards,
> >>     Aritra Bagchi
> >>     Research Scholar,
> >>     Department of Computer Science and Engineering,
> >>     Indian Institute of Technology Delhi
> >>
> >>
> >>
> >> _______________________________________________
> >> gem5-users mailing list -- gem5-users@gem5.org
> >> To unsubscribe send an email to gem5-users-le...@gem5.org
> > _______________________________________________
> > gem5-users mailing list -- gem5-users@gem5.org
> > To unsubscribe send an email to gem5-users-le...@gem5.org
> _______________________________________________
> gem5-users mailing list -- gem5-users@gem5.org
> To unsubscribe send an email to gem5-users-le...@gem5.org
>
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

Reply via email to