Re: [m5-dev] Parallelizing m5 2

Stijn Souffriau Fri, 09 Oct 2009 09:56:34 -0700

Sorry it took so long for me to reply, Ali Saidi, but I've been busy and
I wanted to think things through. I think I had a wrong idea of how m5
works (as I said I'm a newbie). Before I thought m5 worked more
packet-centric, meaning that one SimObject would produce request packets
and connected SimObjects would check every cycle if they had gotten any
requests and then forward or handle them and send replies. I thought
events only drove the internal simulation of SimObjects, including the
constant checking for messages. This is somewhat more intuitive, as
physical systems work more like this, but also inefficient. Now I see
that m5 works according to the DES (Discrete event simulation) execution
paradigm. M5 could really use a documentation page describing it's DES
system!

The challenge then is to change from DES to a PDES (Parallel DES).
Implementing a conservative PDES system in itself is challenging because
efficient parallel processing of events on multiple queues requires that
these queues block as little as possible. A queue can only process
events up to a certain cycle when it is certain that events on other
queues will not schedule anymore events, on it, that need to be
processed before this cycle. Let's call this time the critical latency.
Otherwise you will get causality errors. Let's say that if the
processing of events on one queue can cause the scheduling of events on
another, there is a link from the one to the other. Since generally the
only way you can be certain when and where events will schedule other
events, is to execute them a general solution for calculating the
critical latency does not exist, i.e. if you want to avoid total
synchronization the events will have to tell your queues when it is safe
to proceed. This requires smart events that know which events will be
scheduled in the future originating from it. Then you can dynamically
change links between queues.

Another challenge will be to enable the cross-scheduling of events while
at the same time protecting objects containing simulation state from
concurrent access. Furthermore we should be able to deduce which events
access which state objects so that we can enforce that state-objects can
only be affected by events running on one queue. This will also permit
events to be scheduled on other queues allowing for dynamic workload
scheduling. In theory it is possible to share state between events on
different queues but this is much more difficult.

I have some ideas already on how this can be implemented but I'm going
to do some more research first before I go into specifics. If this turns
out to be to much work I will not be able to do it. What is certain is
that existing code will have to be changed and possible quite
extensively. Maybe I will be able to limit changes to the subsystems I
will be needing, I don't know yet.

Optimistic PDES could also be implemented but this would pretty much
require all of the work needed for conservative PDES + detection and
recovery functionality and it's effectiveness is doubtful.

Are there events that can schedule new events in the same cycle in m5.
If so, a PDES system can deadlock! This can generically be overcome but
will require more work.

Ali Saidi wrote:
> On Oct 4, 2009, at 8:11 AM, Stijn Souffriau wrote:
>
>   
>> Dear developers,
>>
>> I'm a senior computer science student at Ghent University starting  
>> work
>> on my master's thesis. I'm working on the simulation of many-core
>> systems together with researchers from the department of electronics  
>> and
>> information systems. Know that by many cores I mean hundreds of cores.
>> We think we will be able to simulate these cores by using two
>> techniques. One is non cycle accurate core simulation, based on  
>> interval
>> simulation, and the other is parallelization. The former is pretty  
>> much
>> taken care of, the latter is what I will be working on. You could  
>> say my
>> job is to scale the simulator up to many cores and address any  
>> accuracy
>> and synchronization problems along the way. Prior work has been
>> implemented as an interval simulating CPU in the m5-1.1 simulator. We
>> plan on implementing future work in version 2 of m5. At this early  
>> stage
>> I'm faced with several questions mostly concerning the modifiability  
>> of
>> m5 and this is why I'm writing you. I have absolutely no experience in
>> coding m5, yet,  which is why I would appreciate your thoughts on this
>> matter.
>>     
> This is great! It's something that we've slowly been working towards.  
> Nate is also probably going to chime in, but I'll tell you a bit about  
> what we've been thinking.
>
>
>   
>> I'm trying to figure out how generically I can parallelize m5. Ideally
>> all the components or groups of components (SimObjects) suited to  
>> run in
>> parallel will be driven by their own clock.  The main challenge  
>> might be
>> to adapt components so that they can communicate asynchronously. When
>> this is taken care of we can move on to implementing channels which  
>> will
>> facilitate the asynchronous communication between the components.  
>> These
>> channels will mainly protect against concurrency and serialize the
>> requests from asynchronously running SimObjects (e.g. CPU cores) to a
>> shared resource (e.g. L2 D-cache) whenever needed. When this is  
>> finished
>> it would just be a matter of assigning work to threads, scheduling  
>> them
>> and allocating memory (easy if shared, difficult if distributed) in  
>> the
>> core of m5.
>>     
> We've recently started down this route. The two changes that have been  
> implemented so far are a new configuration system in Python than  
> supports inheritance and then using that configuration system to set a  
> pointer to an event queue in which the object should schedule all its  
> events. Currently all the objects schedule their own events, however  
> not all SimObjects support the new coherence system yet.
>
>   

Ideally the simulator could optimize this by choosing what to run in
parallel itself and even dynamically distributing event-processing
across eventqueues. However, at first, this might be the way to go. What
do you mean by the "new coherence system"?

> The three big pieces that are missing are how should threads cross- 
> schedule events on other threads queues (for communication between  
> resources assigned to different queues),

Possibly complicated. Events on one queue should only be able to work
with the simulation state-objects private to that queue. Otherwise you
could get concurrency and causality problems.

>  how different threads should be kept in sync, 

I presume you mean how to avoid threads locking up all the time? In a
first stage I think it would be safest for eventqueues to lock and
synchronize. To give these queues some more space you could use some
imprecise PDES (parallel discrete event simulation) techniques, e.g.
quantum-base simulation or slack based simulation. These will however
cause certain objects to be affected by events later then they should
have been, since they simulated ahead, thus introducing some causality
errors.

> and the building of an interconnect that supported  
> dealing with communicating objects on different threads. 

I don't see why this would be needed if causality errors were
impossible. Please elaborate.

> Depending on the number of events that were flowing between the threads the 
> way  
> they communicate is very important. Ideally some sort of lock free  
> data structure could be used for this. 

True but such optimization is not my primary concern. I'm trying to get
good scalability. Constant speedups are secondary. These changes will
also be limited to one module where as I'm primarily trying to get it
right from an architectural point of view.

> On of our goals was that  
> threaded simulation and single-thread simulation should provide  
> exactly the same result in which case events between threads must be  
> scheduled on the exact same cycle in both cases.

This can be done but prior research on PDES shows that allowing some
imprecision can yield huge performance enhancements. Ideally the user
should be able to choose how much imprecision is introduced in the
simulation.

> If there is a large  
> delay (in simulated time) between the two thread domains this is not  
> too bad, however if there is a short delay it's not clear how this can  
> be done effectively yet. 
>   

I don't fully understand what you mean. If one thread would have to wait
a long time for another one to get to the same cycle wouldn't this be
worse then for short delays? Maybe you are thinking of contextswitch
overhead on systems where the amount of simulating threads is much
larger then the amount of cores. Calculating critical latencies for new
events could lengthen these delays. This would however still not solve
the issue of blocking simulation threads on systems where there are
plenty of cores. This would require threads to be able to run ahead much
further, possibly requiring some imprecision as explained before.

> Finally, we had envisioned new thread-aware  
> interconnect objects which would do the right thing to pass events  
> between threads.
>   

The way I see it the only objects that would need to be thread-aware are
the EventQueues.

> With various hacks a summer student at Michigan had made some progress  
> on running two different systems in the same simulation process on  
> different threads, but the implementation was less than ideal.  
> However, the two systems running at the same time is a good initial  
> goal and can be used to test the sensitivity of the threads/ 
> synchronization to the size of the quantum of simulation.  
> Additionally, an ethernet link could have a reasonable latency and  
> would probably make for a good place to first try out communicating  
> between to threads (each representing a system).
>
>   
>> I've read some of the m5 version 2 documentation and code, and it  
>> seems
>> that quite some effort has already been put in facilitating  
>> asynchronous
>> communication between components (cf. memory system). Yet primarily  
>> for
>> reasons of simplification, rather then for the sake parallelization.  
>> If
>> m5 is consistently designed in this way then parallelizing it could be
>> fairly simple. For example, it would just be a matter of  
>> implementing a
>> layer (some sort of proxy object) between ports to facilitate the
>> asynchronous communication in the memory hierarchy but the MemObjects
>> themselves could remain untouched. Furthermore since all memory  
>> systems
>> use the port interface this could be done very generically. My main
>> question is then if there are still SimObjects in m5 v2 which don't
>> communicate in such a generic event based manner? Maybe some  
>> subsystems
>> will fail when called upon asynchronously and maybe I'm even  
>> overlooking
>> some other serious issues. I also get the feeling that v2 is miles  
>> ahead
>> of v1.1 in this area.
>>     
> Our timing memory system does support asynchronous communication  
> because most real-world memory systems do. M5 v2.0 is several miles  
> ahead of v1.1. All objects in the memory system support both the  
> atomic and timing mode accesses. These inherit from MemObject which  
> inherits from SimObject. There are SimObjects that don't communicate  
> through events, however it's doubtful that you would ever want one of  
> them in an different thread. These are things like TLBs and interrupt  
> controllers which are pretty much welded to the CPU that they're  
> responsible for.
>   

This could be a problem since a good PDES implementation would need to
know which state can be affected by which events. If no event could
affect with 2 SimObjects this would be very easy.

> I think other people will probably be best to answer the rest of your  
> questions.
>
> Ali
>
> _______________________________________________
> m5-dev mailing list
> m5-dev@m5sim.org
> http://m5sim.org/mailman/listinfo/m5-dev
>   
Stijn

_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] Parallelizing m5 2

Reply via email to