Sorry it took so long for me to reply, Ali Saidi, but I've been busy and I wanted to think things through. I think I had a wrong idea of how m5 works (as I said I'm a newbie). Before I thought m5 worked more packet-centric, meaning that one SimObject would produce request packets and connected SimObjects would check every cycle if they had gotten any requests and then forward or handle them and send replies. I thought events only drove the internal simulation of SimObjects, including the constant checking for messages. This is somewhat more intuitive, as physical systems work more like this, but also inefficient. Now I see that m5 works according to the DES (Discrete event simulation) execution paradigm. M5 could really use a documentation page describing it's DES system!
The challenge then is to change from DES to a PDES (Parallel DES). Implementing a conservative PDES system in itself is challenging because efficient parallel processing of events on multiple queues requires that these queues block as little as possible. A queue can only process events up to a certain cycle when it is certain that events on other queues will not schedule anymore events, on it, that need to be processed before this cycle. Let's call this time the critical latency. Otherwise you will get causality errors. Let's say that if the processing of events on one queue can cause the scheduling of events on another, there is a link from the one to the other. Since generally the only way you can be certain when and where events will schedule other events, is to execute them a general solution for calculating the critical latency does not exist, i.e. if you want to avoid total synchronization the events will have to tell your queues when it is safe to proceed. This requires smart events that know which events will be scheduled in the future originating from it. Then you can dynamically change links between queues. Another challenge will be to enable the cross-scheduling of events while at the same time protecting objects containing simulation state from concurrent access. Furthermore we should be able to deduce which events access which state objects so that we can enforce that state-objects can only be affected by events running on one queue. This will also permit events to be scheduled on other queues allowing for dynamic workload scheduling. In theory it is possible to share state between events on different queues but this is much more difficult. I have some ideas already on how this can be implemented but I'm going to do some more research first before I go into specifics. If this turns out to be to much work I will not be able to do it. What is certain is that existing code will have to be changed and possible quite extensively. Maybe I will be able to limit changes to the subsystems I will be needing, I don't know yet. Optimistic PDES could also be implemented but this would pretty much require all of the work needed for conservative PDES + detection and recovery functionality and it's effectiveness is doubtful. Are there events that can schedule new events in the same cycle in m5. If so, a PDES system can deadlock! This can generically be overcome but will require more work. Ali Saidi wrote: > On Oct 4, 2009, at 8:11 AM, Stijn Souffriau wrote: > > >> Dear developers, >> >> I'm a senior computer science student at Ghent University starting >> work >> on my master's thesis. I'm working on the simulation of many-core >> systems together with researchers from the department of electronics >> and >> information systems. Know that by many cores I mean hundreds of cores. >> We think we will be able to simulate these cores by using two >> techniques. One is non cycle accurate core simulation, based on >> interval >> simulation, and the other is parallelization. The former is pretty >> much >> taken care of, the latter is what I will be working on. You could >> say my >> job is to scale the simulator up to many cores and address any >> accuracy >> and synchronization problems along the way. Prior work has been >> implemented as an interval simulating CPU in the m5-1.1 simulator. We >> plan on implementing future work in version 2 of m5. At this early >> stage >> I'm faced with several questions mostly concerning the modifiability >> of >> m5 and this is why I'm writing you. I have absolutely no experience in >> coding m5, yet, which is why I would appreciate your thoughts on this >> matter. >> > This is great! It's something that we've slowly been working towards. > Nate is also probably going to chime in, but I'll tell you a bit about > what we've been thinking. > > > >> I'm trying to figure out how generically I can parallelize m5. Ideally >> all the components or groups of components (SimObjects) suited to >> run in >> parallel will be driven by their own clock. The main challenge >> might be >> to adapt components so that they can communicate asynchronously. When >> this is taken care of we can move on to implementing channels which >> will >> facilitate the asynchronous communication between the components. >> These >> channels will mainly protect against concurrency and serialize the >> requests from asynchronously running SimObjects (e.g. CPU cores) to a >> shared resource (e.g. L2 D-cache) whenever needed. When this is >> finished >> it would just be a matter of assigning work to threads, scheduling >> them >> and allocating memory (easy if shared, difficult if distributed) in >> the >> core of m5. >> > We've recently started down this route. The two changes that have been > implemented so far are a new configuration system in Python than > supports inheritance and then using that configuration system to set a > pointer to an event queue in which the object should schedule all its > events. Currently all the objects schedule their own events, however > not all SimObjects support the new coherence system yet. > > Ideally the simulator could optimize this by choosing what to run in parallel itself and even dynamically distributing event-processing across eventqueues. However, at first, this might be the way to go. What do you mean by the "new coherence system"? > The three big pieces that are missing are how should threads cross- > schedule events on other threads queues (for communication between > resources assigned to different queues), Possibly complicated. Events on one queue should only be able to work with the simulation state-objects private to that queue. Otherwise you could get concurrency and causality problems. > how different threads should be kept in sync, I presume you mean how to avoid threads locking up all the time? In a first stage I think it would be safest for eventqueues to lock and synchronize. To give these queues some more space you could use some imprecise PDES (parallel discrete event simulation) techniques, e.g. quantum-base simulation or slack based simulation. These will however cause certain objects to be affected by events later then they should have been, since they simulated ahead, thus introducing some causality errors. > and the building of an interconnect that supported > dealing with communicating objects on different threads. I don't see why this would be needed if causality errors were impossible. Please elaborate. > Depending on the number of events that were flowing between the threads the > way > they communicate is very important. Ideally some sort of lock free > data structure could be used for this. True but such optimization is not my primary concern. I'm trying to get good scalability. Constant speedups are secondary. These changes will also be limited to one module where as I'm primarily trying to get it right from an architectural point of view. > On of our goals was that > threaded simulation and single-thread simulation should provide > exactly the same result in which case events between threads must be > scheduled on the exact same cycle in both cases. This can be done but prior research on PDES shows that allowing some imprecision can yield huge performance enhancements. Ideally the user should be able to choose how much imprecision is introduced in the simulation. > If there is a large > delay (in simulated time) between the two thread domains this is not > too bad, however if there is a short delay it's not clear how this can > be done effectively yet. > I don't fully understand what you mean. If one thread would have to wait a long time for another one to get to the same cycle wouldn't this be worse then for short delays? Maybe you are thinking of contextswitch overhead on systems where the amount of simulating threads is much larger then the amount of cores. Calculating critical latencies for new events could lengthen these delays. This would however still not solve the issue of blocking simulation threads on systems where there are plenty of cores. This would require threads to be able to run ahead much further, possibly requiring some imprecision as explained before. > Finally, we had envisioned new thread-aware > interconnect objects which would do the right thing to pass events > between threads. > The way I see it the only objects that would need to be thread-aware are the EventQueues. > With various hacks a summer student at Michigan had made some progress > on running two different systems in the same simulation process on > different threads, but the implementation was less than ideal. > However, the two systems running at the same time is a good initial > goal and can be used to test the sensitivity of the threads/ > synchronization to the size of the quantum of simulation. > Additionally, an ethernet link could have a reasonable latency and > would probably make for a good place to first try out communicating > between to threads (each representing a system). > > >> I've read some of the m5 version 2 documentation and code, and it >> seems >> that quite some effort has already been put in facilitating >> asynchronous >> communication between components (cf. memory system). Yet primarily >> for >> reasons of simplification, rather then for the sake parallelization. >> If >> m5 is consistently designed in this way then parallelizing it could be >> fairly simple. For example, it would just be a matter of >> implementing a >> layer (some sort of proxy object) between ports to facilitate the >> asynchronous communication in the memory hierarchy but the MemObjects >> themselves could remain untouched. Furthermore since all memory >> systems >> use the port interface this could be done very generically. My main >> question is then if there are still SimObjects in m5 v2 which don't >> communicate in such a generic event based manner? Maybe some >> subsystems >> will fail when called upon asynchronously and maybe I'm even >> overlooking >> some other serious issues. I also get the feeling that v2 is miles >> ahead >> of v1.1 in this area. >> > Our timing memory system does support asynchronous communication > because most real-world memory systems do. M5 v2.0 is several miles > ahead of v1.1. All objects in the memory system support both the > atomic and timing mode accesses. These inherit from MemObject which > inherits from SimObject. There are SimObjects that don't communicate > through events, however it's doubtful that you would ever want one of > them in an different thread. These are things like TLBs and interrupt > controllers which are pretty much welded to the CPU that they're > responsible for. > This could be a problem since a good PDES implementation would need to know which state can be affected by which events. If no event could affect with 2 SimObjects this would be very easy. > I think other people will probably be best to answer the rest of your > questions. > > Ali > > _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org > http://m5sim.org/mailman/listinfo/m5-dev > Stijn
_______________________________________________ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev