Hi  All,

Thank you Mohammad for your elaboration on the issues!

I have written most of the multi-gem5 patch so let me add some more 
clarifications  and answer to your concerns. My comments are inline below.

Thanks,
- Gabor

On 27/06/2015 10:20, "gem5-dev on behalf of Mohammad Alian"
<gem5-dev-boun...@gem5.org<mailto:gem5-dev-boun...@gem5.org> on behalf of 
al...@wisc.edu<mailto:al...@wisc.edu>> wrote:

Hi All,

Curtis-Thank you for listing some of the differences. I was waiting for
the
completed multi-gem5 patch before I send my review. Please see my inline
response below. I’ve addressed the concerns that you’ve raised. Also, I’ve
added a bit more to the comparison.

-*  Synchronization.

pd-gem5 implements this in Python (not a problem in itself; aesthetically

this is nice, but...).  The issue is that pd-gem5's data packets and

barrier messages travel over different sockets.  Since pd-gem5 could see

data packets passing synchronization barriers, it could create an

inconsistent checkpoint.

multi-gem5's synchronization is implemented in C++ using sync events, but

more importantly, the messages queue up in the same stream and so cannot

have the issue just described.  (Event ordering is often crucial in

snapshot protocols.) Therefore we feel that multi-gem5 is a more robust

solution in this respect.

Each packet in pd-gem5 has a time-stamp. So even if data packets pass
synchronization barriers (in another word data packets arrive early at the
destination node), destination node process packets based on their
timestamp. Actually allowing data packets to pass sync barriers is a nice
feature that can reduce the likelihood of late packet reception. Ordering
of data messages that flow over pd-gem5 nodes is also preserved in pd-gem5
implementation.

This seems to be a misunderstanding. Maybe the wording was not precise before. 
The problem isn’t a data packet that “passing” a sync barrier but the other  
way around, a sync barrier that can pass a data packet (e.g. while the data 
packet is waiting in the host operating system socket layer).  If that happens, 
the packet will arrive later than it was supposed to and it may miss the 
computed receive tick.

For instance, let’s assume that the quantum coincides with the simulated Ether 
link delay. (This is the optimal choice of quantum to minimize the number of 
sync barriers.)  If a data packet is sent right at the beginning of a quantum 
then this packet must arrive at the destination gem5 process within the same 
quantum in order not to miss its receive tick at the very beginning of the next 
quantum. If the sync barrier can pass the data packet then the data packet may 
arrive only during the next quantum (or  in extreme conditions even later than 
that) so when it arrives the receiver gem5 may pass already the receive tick.

 Time-stamping does help with this issue. Also, if a data packet is waiting in 
the host operating system socket layer when the simulation thread exits to 
python to complete the next sync barrier  then the packet will not go into the 
checkpoint that may follow that sync barrier.


What you mentioned as an advantage for multi-gem5 is actually a key
disadvantage: buffering sync messages behind data packets can add up to
the
synchronization overhead and slow down simulation significantly.

The purpose of sync messages is to make sure that the data packets arrive in 
time (in terms of simulated time) at the destination so they can be scheduled 
for being received at the proper computed tick.  Sync messages also make sure 
that no data packets are in flight when a sync barrier completes before we take 
a checkpoint.  They definitely add overhead for the simulation but they  are 
necessary for the correctness of the simulation.

The receive thread in multi-gem5 reads out packets from the socket in parallel 
with the simulation thread so packets normally will not be "queueing up” before 
 a sync barrier message.  There is definitely rooms for improvement in the 
current implementation for reducing the synchronization overhead but that is 
likely true for pd-gem5, too. The important thing here is that the solution 
must provide correctness (robustness) first.

Also,
multi-gem5 send huge sized messages (multiHeaderPkt) through network to
perform each synchronization point, which increases synchronization
overhead further. In pd-gem5, we choose to send just one character as sync
message through a separate socket to reduce synchronization overhead.

The TCP/IP message size is unlikely the bottleneck here. Multi-gem5 will send 
~50 bytes more in a sync barrier message than pd-gem5 but that bigger sync 
message still fits into a single ethernet frame on the wire. The end-to-end 
latency overhead that is caused by 50 bytes extra payload for a small single 
frame TCP/IP message is likely to fall into the ‘noise’ category if one tries 
to measure it in a real cluster.


*  Packet handling.

pd-gem5 uses EtherTap for data packets but changed the polling mechanism

to go through the main event queue.  Since this rate is actually linked

with simulator progress, it cannot guarantee that the packets are
serviced

at regular intervals of real time.  This can lead to packets queueing up

which would contribute to the synchronization issues mentioned above.

multi-gem5 uses plain sockets with separate receive threads and so does
not

have this issue.

I think again you are pointing to your first concern that I’ve explained
above. Packets that have queued up in EtherTap socket, will be processed
and delivered to simulation environment at the beginning of next
simulation
quantum.

As I pointed out above, packet queued up in the EtherTap socket may miss the 
proper quantum to get received and/or a checkpoint to be saved.


Please notice that multi-gem5 introduces a new simObjects to interface
simulation environment to real world which is redundant. This
functionality
is already there by EtherTap.

Except that the EtherTap solution does not provide a correct (robust) solution 
for the synchronization problem.


* Checkpoint accuracy.

A user would like to have a checkpoint at precisely the time the

'm5 checkpoint' operation is executed so as to not miss any of the

area of interest in his application.

pd-gem5 requires that simulation finish the current quantum

before checkpointing, so it cannot provide this.

(Shortening the quantum can help, but usually the snapshot is being taken

while 'fast-forwarding', i.e. simulating as fast as possible, which would

motivate a longer quantum.)

multi-gem5 can enter the drain cycle immediately upon receiving a

checkpoint request.  We find this accuracy highly desirable.

It’s true that if you have a large quantum size then there would be some
discrepancy between the m5_ckpt instruction tick and the actual dump tick.
Based on multi-gem5 code, my understanding is that you send async
checkpoint message as soon as one of the gem5 processes encounter m5_ckpt
instruction. But I’m not sure how you fix the aforementioned issue,
because
you have to sync all gem5 processes before you start dumping checkpoint,
which necessitate a global synchronization beforehand.

In multi-gem5, the gem5 process who encounters the m5_ckpt instruction sends 
out an async checkpoint notification for the peer gem5 processes and then it 
starts the draining immediately (at the same tick).  So the checkpoint will be 
taken at the exact tick form the initiator process point of view. The global 
synchronisation with the peer processes takes place while the initiator process 
is still waiting at the same tick (i.e the simulation thread is suspended). 
However,  the receiver thread continues reading out the socket - while waiting 
for the global sync to complete- to make sure that in-flight data packets from 
peer gem5 processes are stored properly and saved into the checkpoint.


By the way, we have a fix for this issue by introducing a new m5 pseudo
instruction.

I fail to see how a new pseudo instruction can solve the problem of completing 
the full quantum in pd-gem5 before a checkpoint can be taken. Could you please 
elaborate on that?


* Implementation of network topology.

pd-gem5 uses a separate gem5 process to act as a switch whereas multi-gem5

uses a standalone packet relay process.

We haven't measured the overhead of pd-gem5's simulated switch yet, but

we're confident that our approach is at least as fast and more scalable.

There is this flexibility in pd-gem5 to simulate a switch box alongside
one
of the other gem5 processes. However, it might make that gem5 process the
simulation bottleneck. One of the advantages of pd-gem5 over multi-gem5 is
that we use gem5 to simulate a switch box, which allows us to model any
network topology by instantiating several Switch simObjects and
interconnect them with EhterLink in an arbitrary fashion. A standalone tcp
server just can provide switch functionality (forwarding packets to
destinations) and model a star network topology. Furthermore, it cannot
model various network timings such as queueing delay, congestion, and
routing latency. Also it has some accuracy issues that I will point out
next.

I agree with the complex topology argument –I already mentioned that before as 
an advantage for pd-gem5 from the point of view of future extensions. However, 
I do not agree that multi-gem5 cannot model queueing delays and congestions. 
For a simple crossbar switch, it can model queueing delays and congestions, but 
the receive queues are distributed among gem5 processes.


* Broken network timing:

Forwarding packets between gem5 processes using a standalone tcp server
can
cause reordering between packets that have different source but same
destination. It causes  inaccurate network timing and worse of all
non-deterministic simulation. pd-gem5 resolve this by reordering packets
at
Switch process and then send them to their destination (it’s possible as
switch is synchronized with the rest of the nodes).

In multi-gem5, there is always a HeaderPkt that contains some meta information 
for each data packet. The meta information include the send tick and the sender 
rank (i.e. a  unique ID of the sender gem5 process). We use those information 
to define a well defined ordering of packets even if packets are arriving at 
the same receiver from different senders.  This packet ordering scheme is still 
being tested so the corresponding patch is not on the RB yet.


* Amount of changes

pd-gem5 introduce different modes in etherlink just to provide accurate
timing for each component in the network subsystem (NIC, link, switch) as
well as capability of modeling different network topologies (mesh, ring,
fat tree, etc). To enable a simple functionality, like what multi-gem5
provides, the amount of changes in gem5 can be limited to time-stamping
packets and providing synchronization through python scripts. However,
multi-gem5 re-implements functionalists that are already in gem5.

This argument holds only if both implementations are correct (robust). It still 
seems to me that pd-gem5 does not provide correctness for the 
synchronization/checkpointing parts.


* Integrating with gem5 mainstream:

pd-gem5 launch script is written in python which is suited for integration
with gem5 python scripts. However multi-gem5 uses bash script. Also, all
source files in pd-gem5 are already parts of gem5 mainstream. However
multi-gem5 has tcp_server.cc/hh that is a standalone process and cannot be
part of gem5.

The multi-gem5 launch script is simply enough to rely only on the shell.  It 
can obviously be easily re-written in python if that added any value.  The 
tcp_server component is only a utility (like the  ‘m5’ utility that is also 
part of gem5).

Cheers,
- Gabor



On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham 
<curtis.dun...@arm.com<mailto:curtis.dun...@arm.com>>
wrote:

Hello everyone,
We have taken a look at how pd-gem5 compares with multi-gem5.  While
intending
to deliver the same functionality, there are some crucial differences:

*  Synchronization.

    pd-gem5 implements this in Python (not a problem in itself;
aesthetically
    this is nice, but...).  The issue is that pd-gem5's data packets and
    barrier messages travel over different sockets.  Since pd-gem5 could
see
    data packets passing synchronization barriers, it could create an
    inconsistent checkpoint.

    multi-gem5's synchronization is implemented in C++ using sync events,
but
    more importantly, the messages queue up in the same stream and so
cannot
    have the issue just described.  (Event ordering is often crucial in
    snapshot protocols.) Therefore we feel that multi-gem5 is a more
robust
    solution in this respect.

*  Packet handling.

    pd-gem5 uses EtherTap for data packets but changed the polling
mechanism
    to go through the main event queue.  Since this rate is actually
linked
    with simulator progress, it cannot guarantee that the packets are
serviced
    at regular intervals of real time.  This can lead to packets
queueing up
    which would contribute to the synchronization issues mentioned above.

    multi-gem5 uses plain sockets with separate receive threads and so
does
not
    have this issue.

* Checkpoint accuracy.

   A user would like to have a checkpoint at precisely the time the
   'm5 checkpoint' operation is executed so as to not miss any of the
   area of interest in his application.

   pd-gem5 requires that simulation finish the current quantum
   before checkpointing, so it cannot provide this.

   (Shortening the quantum can help, but usually the snapshot is being
taken
   while 'fast-forwarding', i.e. simulating as fast as possible, which
would
   motivate a longer quantum.)

   multi-gem5 can enter the drain cycle immediately upon receiving a
   checkpoint request.  We find this accuracy highly desirable.

* Implementation of network topology.

   pd-gem5 uses a separate gem5 process to act as a switch whereas
multi-gem5
   uses a standalone packet relay process.

   We haven't measured the overhead of pd-gem5's simulated switch yet,
but
   we're confident that our approach is at least as fast and more
scalable.


Thanks,
Curtis
________________________________________
From: gem5-dev [gem5-dev-boun...@gem5.org<mailto:gem5-dev-boun...@gem5.org>] On 
Behalf Of Mohammad Alian [
al...@wisc.edu<mailto:al...@wisc.edu>]
Sent: Friday, June 26, 2015 7:37 PM
To: gem5 Developer List
Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
system
on multiple physical hosts

Hi Anthony,

I think that would be a good option, then I can add pd-gem5
functionality
on top of that. Right now I've simplified your implementation. Also, I
think I had found some bugs in your patch that I cannot remember now. If
you decided to ship EtherSwitch patch, let me know to give you a review
on
that.

Thanks,
Mohammad

On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
anthony.gutier...@amd.com<mailto:anthony.gutier...@amd.com>> wrote:

> Would it make sense for me to ship the EtherSwitch patch first, since
it
> has utility on its own, and then we can decide which of the
"multi-gem5"
> approaches is best, or if it's some combination of both?
>
> The only reason I never shipped it was because Steve raised an issue
that
> I didn't have a good alternative for, and didn't have the time to look
into
> one at that time.
> ________________________________________
> From: gem5-dev [gem5-dev-boun...@gem5.org<mailto:gem5-dev-boun...@gem5.org>] 
> on behalf of Mohammad
Alian [
> al...@wisc.edu<mailto:al...@wisc.edu>]
> Sent: Wednesday, June 24, 2015 12:43 PM
> To: gem5 Developer List
> Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
system
> on multiple physical hosts
>
> Hi Andreas,
>
> Thanks for the comment.
> I think the checkpointing support in both works is the same. Here is
how
> checkpointing support is implemented in pd-gem5:
>
> Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> instruction, it will send a “recv-ckpt” signal to the
> “barrier” process. Then the “barrier” process sends a “take-ckpt”
signal
to
> all the simulated nodes
> (including the node that encountered m5-checkpoint) at the end of the
> current simulation quantum. On the reception of
> “take-ckpt” signal, gem5 processes start dumping check-points. This
makes
> each simulated node dump a checkpoint
> at the same simulated time point while ensuring there is no in-flight
> packets.
>
> I believe this is the same as multi-gem5 patch approach for checkpoint
> support (based on the commit message of
http://reviews.gem5.org/r/2865/
).
> Also, we have tested our mechanism with several benchmarks and it
works.
As
> Steve suggested, I'll look into Curtis's patch and try to review it as
> well.
> But as Nilay also mentioned earlier, there are some codes missing in
> Curtis's patch. I prefer to first run multi-gem5 before starting to
review
> it.
>
> Thank you,
> Mohammad
>
> On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
andreas.hans...@arm.com<mailto:andreas.hans...@arm.com>>
> wrote:
>
> > Hi Steve,
> >
> > Apologies for the confusion. We are on the same page. My point is
that
we
> > cannot simply take a little bit of patch A and a little bit of
patch B.
> > This change involves a lot of code, and we need to approach this in
a
> > structured fashion. My proposal is to do it bottom up, and start by
> > getting the basic support in place. Since
> http://reviews.gem5.org/r/2826/
> > has already been on the review board for a few months, I am merely
> > suggesting that the it would be a good start to relate the newly
posted
> > patches to what is already there.
> >
> > Andreas
> >
> >
> >
> > On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> > <gem5-dev-boun...@gem5.org<mailto:gem5-dev-boun...@gem5.org> on behalf of 
> > ste...@gmail.com<mailto:ste...@gmail.com>> wrote:
> >
> > >Hi Andreas,
> > >
> > >I'm a little confused by your email---you say you're fundamentally
> opposed
> > >to looking at both patches and picking the best features, then you
point
> > >out that the patches Curtis posted have the feature of better
> > >checkpointing
> > >support so we should pick that :).
> > >
> > >Obviously we can't just pick patch A from Mohammad's set and patch
B
> from
> > >Curtis's set and expect them to work together, but I think that
having
> > >both
> > >sets of patches available and comparing and contrasting the two
> > >implementations should enable us to get to a single implementation
> that's
> > >the best of both. Someone will have to make the effort of
integrating
> the
> > >better ideas from one set into the other set to create a new
unified
set
> > >of
> > >patches; (or maybe we commit one set and then integrate the best of
the
> > >other set as patches on top of that), but the first step is to
identify
> > >what "the best of both" is.  Having Mohammad look at Curtis's
patches,
> and
> > >Curtis (or someone else from ARM) closely examine Mohammad's
patches
> would
> > >be a great start.  I intend to review them both, though
unfortunately
my
> > >time has been scarce lately---I'm hoping to squeeze that in later
this
> > >week.
> > >
> > >Once we've had a few people look at both, we can discuss the pros
and
> cons
> > >of each, then discuss the strategy for getting the best features
in.
So
> > >far I've heard that Mohammad's patches have a better network model
but
> the
> > >ARM patches have better checkpointing support; that seems like a
good
> > >start.
> > >
> > >Steve
> > >
> > >On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> andreas.hans...@arm.com<mailto:andreas.hans...@arm.com>
> > >
> > >wrote:
> > >
> > >> Hi all,
> > >>
> > >> Great work. However, I fundamentally do not believe in the
approach
of
> > >> ‘letting reviewers pick the best features’. There is no way we
would
> > >>ever
> > >> get something working out if it. We need to get _one_ working
solution
> > >> here, and figure out how to best get there. I would propose to
do it
> > >> bottom up, starting with the basic multi-simulator instance
support,
> > >> checkpointing support, and then move on to the network between
the
> > >> simulator instances.
> > >>
> > >> Thus, I propose we go with the low-level plumbing and checkpoint
> support
> > >> from what Curtis has posted. I believe proper checkpointing
support
to
> > >>be
> > >> the most challenging, and from what I can tell this is far more
> limited
> > >>in
> > >> what you just posted Mohammad. Could you perhaps review Curtis
patches
> > >> based on your insights, and we can try and get these patches in
shape
> > >>and
> > >> committed asap.
> > >>
> > >> Once we have the baseline functionality in place, then we can
start
> > >> looking at the more elaborate network models.
> > >>
> > >> Does this sound reasonable?
> > >>
> > >> Thanks,
> > >>
> > >> Andreas
> > >>
> > >> On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> > >> <gem5-dev-boun...@gem5.org<mailto:gem5-dev-boun...@gem5.org> on behalf 
> > >> of al...@wisc.edu<mailto:al...@wisc.edu>> wrote:
> > >>
> > >> >Hello All,
> > >> >
> > >> >I have submitted a chain of patches which enables gem5 to
simulate
a
> > >> >cluster on multiple physical hosts:
> > >> >
> > >> >http://reviews.gem5.org/r/2909/
> > >> >http://reviews.gem5.org/r/2910/
> > >> >http://reviews.gem5.org/r/2912/
> > >> >http://reviews.gem5.org/r/2913/
> > >> >http://reviews.gem5.org/r/2914/
<http://reviews.gem5.org/r/2914/>
> > >> >
> > >> >and a patch that contains run scripts for a simple experiment:
> > >> >http://reviews.gem5.org/r/2915/
> > >> >
> > >> >We have run several benchmarks using this infrastructure,
including
> NAS
> > >> >parallel benchmarks (MPI) and DCBench-hadoop
> > >> >(http://prof.ict.ac.cn/DCBench/),
> > >> >and would be happy to share scripts/diskimages.
> > >> >
> > >> >We call this *pd-gem5*. *pd-gem5 *functionality is more or less
the
> > >>same
> > >> >as
> > >> >Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
*network
> > >>model
> > >> >is
> > >> >more thorough; it also enables modeling different network
topologies.
> > >> >Having both set of changes together let reviewers to pick best
> features
> > >> >from both works.
> > >> >
> > >> >Thank you,
> > >> >Mohammad Alian
> > >> >_______________________________________________
> > >> >gem5-dev mailing list
> > >> >gem5-dev@gem5.org<mailto:gem5-dev@gem5.org>
> > >> >http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >>
> > >> -- IMPORTANT NOTICE: The contents of this email and any
attachments
> are
> > >> confidential and may also be privileged. If you are not the
intended
> > >> recipient, please notify the sender immediately and do not
disclose
> the
> > >> contents to any other person, use it for any purpose, or store or
copy
> > >>the
> > >> information in any medium.  Thank you.
> > >>
> > >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
9NJ,
> > >> Registered in England & Wales, Company No:  2557590
> > >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
CB1
> > >>9NJ,
> > >> Registered in England & Wales, Company No:  2548782
> > >> _______________________________________________
> > >> gem5-dev mailing list
> > >> gem5-dev@gem5.org<mailto:gem5-dev@gem5.org>
> > >> http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >_______________________________________________
> > >gem5-dev mailing list
> > >gem5-dev@gem5.org<mailto:gem5-dev@gem5.org>
> > >http://m5sim.org/mailman/listinfo/gem5-dev
> >
> >
> > -- IMPORTANT NOTICE: The contents of this email and any attachments
are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose
the
> > contents to any other person, use it for any purpose, or store or
copy
> the
> > information in any medium.  Thank you.
> >
> > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No:  2557590
> > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
9NJ,
> > Registered in England & Wales, Company No:  2548782
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev@gem5.org<mailto:gem5-dev@gem5.org>
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> gem5-dev@gem5.org<mailto:gem5-dev@gem5.org>
> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-dev@gem5.org<mailto:gem5-dev@gem5.org>
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org<mailto:gem5-dev@gem5.org>
http://m5sim.org/mailman/listinfo/gem5-dev

-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy
the
information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
9NJ,
Registered in England & Wales, Company No:  2548782

_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org<mailto:gem5-dev@gem5.org>
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org<mailto:gem5-dev@gem5.org>
http://m5sim.org/mailman/listinfo/gem5-dev


-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered 
in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, 
Registered in England & Wales, Company No: 2548782
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to