Re: parallel cost-centre profiling
On 23/06/2010 18:44, Henrique Ferreiro wrote: 2010/6/15 Simon Marlowmarlo...@gmail.com: On 15/06/2010 16:28, Henrique Ferreiro wrote: I got the most important pieces working (I think). The question now is, are you interested in this? It all depends on whether other people would find it useful or not. At the moment I'm not convinced that profiling each Capability separately will give results that are useful, because the assignment of work to Capabilities is quite arbitrary and will change from run to run. The only way to get meaningful results would be to use forkOnIO; this won't be very useful for par/pseq or Strategies. I've been thinking about this and came to the conclusion that profiling per capability isn't only useless but wrong as it might build stacks of completely unrelated code. You said before that per-thread profiling is much harder. So, what would be the problem of saving and restoring the profiling information each time a thread is run? So one option is to make a new CCS root for each thread, and that way each thread would end up creating its own tree of CCSs. That would seem to work nicely - you get per-thread stacks almost for free. Unfortunately it's not really per-thread profiling, because if one thread happens to evaluate a thunk created by another thread then the costs of doing so would be attributed to the thread that created the thunk (maybe that's what you want, and maybe it's consistent with the CCS view of the world, I'm not sure). This also means you still need to lock access to the CCS structures, because two threads might be accessing the same one simultaneously. Do you really want per-thread profiling, anyway? What happens when there are thousands of threads? I didn't know about this. I've done some really small tests and the overhead of the HPC system seems lower the the one from the profiling system. The problem I see is that it may have to be changed a lot to comply with the cost centre semantics (subsuming costs and the like). Well, it would be a completely different cost semantics. Whether that's good or bad I can't say. Cheers, Simon ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc
Re: parallel cost-centre profiling
2010/7/1 Simon Marlow marlo...@gmail.com: So one option is to make a new CCS root for each thread, and that way each thread would end up creating its own tree of CCSs. That would seem to work nicely - you get per-thread stacks almost for free. Unfortunately it's not really per-thread profiling, because if one thread happens to evaluate a thunk created by another thread then the costs of doing so would be attributed to the thread that created the thunk (maybe that's what you want, and maybe it's consistent with the CCS view of the world, I'm not sure). This also means you still need to lock access to the CCS structures, because two threads might be accessing the same one simultaneously. Yes, I was thinking about the new CCS root you mention. To solve the sharing problem I was thinking about storing CCS ids instead of pointers and make each thread allocate a new CCS each time it finds a new id. I think it shouldn't be much work as there shouldn't be that many shared CCSs. Do you really want per-thread profiling, anyway? What happens when there are thousands of threads? Anyway I think I can provide valuable information by doing offline analysis of the data or hiding useless information. Also, integration with threadscope of similar tools will only look at the information relevant to what is being displayed, in the same way you use different zoom levels. I didn't know about this. I've done some really small tests and the overhead of the HPC system seems lower the the one from the profiling system. The problem I see is that it may have to be changed a lot to comply with the cost centre semantics (subsuming costs and the like). Well, it would be a completely different cost semantics. Whether that's good or bad I can't say. Cheers, Simon ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc
Re: parallel cost-centre profiling
2010/6/15 Simon Marlow marlo...@gmail.com: On 15/06/2010 16:28, Henrique Ferreiro wrote: I got the most important pieces working (I think). The question now is, are you interested in this? It all depends on whether other people would find it useful or not. At the moment I'm not convinced that profiling each Capability separately will give results that are useful, because the assignment of work to Capabilities is quite arbitrary and will change from run to run. The only way to get meaningful results would be to use forkOnIO; this won't be very useful for par/pseq or Strategies. I've been thinking about this and came to the conclusion that profiling per capability isn't only useless but wrong as it might build stacks of completely unrelated code. You said before that per-thread profiling is much harder. So, what would be the problem of saving and restoring the profiling information each time a thread is run? So the question really is: what information do you want out? I can see a use for just doing ordinary profiling on parallel programs, although the overheads of profiling may well get in the way of getting useful information out. I definitely want per-thread profiling information. I also wanted to do event logging at each interval so I get timing information. One thing you could do is to use HPC for profiling. The idea would be to record the last tick, and sample the thread/tick at each interval. Then you get a per-thread profile, but without the stack information of the cost-centre profiler. Perhaps the overhead of this might be too high though. I didn't know about this. I've done some really small tests and the overhead of the HPC system seems lower the the one from the profiling system. The problem I see is that it may have to be changed a lot to comply with the cost centre semantics (subsuming costs and the like). Cheers, Simon ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc
Re: parallel cost-centre profiling
On 14/06/2010 08:39, Henrique Ferreiro wrote: Sorry for the late replay, I thought I made it work but I am still fighting with it. Do you really need to do this? Why not share the stacks and use a mutex to protect the operations? My idea is to allow for profiling of each capability, so I need to keep the stacks independent of each other. Otherwise, we would get the same output as if it had been run sequentially. I'm not sure it's useful to profile each Capability separately. Threads migrate between Capabilities under the control of the runtime system, so you won't get the same results from run to run. Perhaps what you really wanted was per-thread profiling? But that's much harder -you'd need per-thread stacks. It would be a big change to the profiling system, and I'm not really sure whether the benefit is worth it. Cheers, Simon ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc
Re: parallel cost-centre profiling
On 15/06/2010 16:28, Henrique Ferreiro wrote: I got the most important pieces working (I think). The question now is, are you interested in this? It all depends on whether other people would find it useful or not. At the moment I'm not convinced that profiling each Capability separately will give results that are useful, because the assignment of work to Capabilities is quite arbitrary and will change from run to run. The only way to get meaningful results would be to use forkOnIO; this won't be very useful for par/pseq or Strategies. So the question really is: what information do you want out? I can see a use for just doing ordinary profiling on parallel programs, although the overheads of profiling may well get in the way of getting useful information out. One thing you could do is to use HPC for profiling. The idea would be to record the last tick, and sample the thread/tick at each interval. Then you get a per-thread profile, but without the stack information of the cost-centre profiler. Perhaps the overhead of this might be too high though. Cheers, Simon ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc
Re: parallel cost-centre profiling
Sorry for the late replay, I thought I made it work but I am still fighting with it. Do you really need to do this? Why not share the stacks and use a mutex to protect the operations? My idea is to allow for profiling of each capability, so I need to keep the stacks independent of each other. Otherwise, we would get the same output as if it had been run sequentially. Cheers, Simon ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc
Re: parallel cost-centre profiling
Hi! I manage to make it work with one capability. That is, now the CCCS is stored per capability but I can't enable parallel profiling because the stacks are still shared. Now I wanted to have one instance of each stack and cost centre per capability, so that I can track the behaviour of each capability independently. Doing that is easy in the runtime system but I found out that I also have to change the code generator in the same way. I need emitCostCentreDecl and costCentreStackDecl to declare an array of structs instead of a single struct. Can someone help me with this? 2010/6/4 Henrique Ferreiro hferre...@udc.es: Hi again! I need some help with this. 2010/3/17 Simon Marlow marlo...@gmail.com: On 16/03/2010 19:34, Henrique Ferreiro wrote: Hello! I am trying to make cost centre profiling work in the threaded rts build in order to use that information to better understand parallel behaviour. Currently I am learning about the internals of GHC and I am thinking about how this could be done. The main blocker is that the current cost centre stack is a shared global variable. The simplest solution I came up with is to convert it to a thread local variable. The problem would be how to access it from the global timer. Yes, basically what you want to do is put CCCS into the StgRegs structure, which will make it thread-local. In the timer signal you want to bump the counters for the CCCS on each Capability - so just iterate through the array of Capabilities and bump each one. I tried adding a new register CCCS to StgRegTable in stg/Regs.h but there is too much hidden knowledge in the code and it isn't working. I got to the point where I have changed every reference to the global variable to this register. The problem is that it isn't getting updated. Debugging a bit it seems that the register is used as it should (the calls to PushCostCentre and AppendCCS from the generated code use the correct value in CCCS) but it isn't getting stored in the register table because every time the timer is called, the value stored is CCS_SYSTEM, the one used in initialization. I tried to mimic how the other registers are implemented but there is no documentation at all so I wasn't sure what was exactly required. Could someone tell where exactly I have to change things or should I post my changes and ask about specific details? ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc
Re: parallel cost-centre profiling
On 10/06/10 18:06, Henrique Ferreiro wrote: Hi! I manage to make it work with one capability. That is, now the CCCS is stored per capability but I can't enable parallel profiling because the stacks are still shared. Now I wanted to have one instance of each stack and cost centre per capability, so that I can track the behaviour of each capability independently. Doing that is easy in the runtime system but I found out that I also have to change the code generator in the same way. I need emitCostCentreDecl and costCentreStackDecl to declare an array of structs instead of a single struct. Can someone help me with this? Do you really need to do this? Why not share the stacks and use a mutex to protect the operations? Cheers, Simon 2010/6/4 Henrique Ferreirohferre...@udc.es: Hi again! I need some help with this. 2010/3/17 Simon Marlowmarlo...@gmail.com: On 16/03/2010 19:34, Henrique Ferreiro wrote: Hello! I am trying to make cost centre profiling work in the threaded rts build in order to use that information to better understand parallel behaviour. Currently I am learning about the internals of GHC and I am thinking about how this could be done. The main blocker is that the current cost centre stack is a shared global variable. The simplest solution I came up with is to convert it to a thread local variable. The problem would be how to access it from the global timer. Yes, basically what you want to do is put CCCS into the StgRegs structure, which will make it thread-local. In the timer signal you want to bump the counters for the CCCS on each Capability - so just iterate through the array of Capabilities and bump each one. I tried adding a new register CCCS to StgRegTable in stg/Regs.h but there is too much hidden knowledge in the code and it isn't working. I got to the point where I have changed every reference to the global variable to this register. The problem is that it isn't getting updated. Debugging a bit it seems that the register is used as it should (the calls to PushCostCentre and AppendCCS from the generated code use the correct value in CCCS) but it isn't getting stored in the register table because every time the timer is called, the value stored is CCS_SYSTEM, the one used in initialization. I tried to mimic how the other registers are implemented but there is no documentation at all so I wasn't sure what was exactly required. Could someone tell where exactly I have to change things or should I post my changes and ask about specific details? ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc
Re: parallel cost-centre profiling
Hi again! I need some help with this. 2010/3/17 Simon Marlow marlo...@gmail.com: On 16/03/2010 19:34, Henrique Ferreiro wrote: Hello! I am trying to make cost centre profiling work in the threaded rts build in order to use that information to better understand parallel behaviour. Currently I am learning about the internals of GHC and I am thinking about how this could be done. The main blocker is that the current cost centre stack is a shared global variable. The simplest solution I came up with is to convert it to a thread local variable. The problem would be how to access it from the global timer. Yes, basically what you want to do is put CCCS into the StgRegs structure, which will make it thread-local. In the timer signal you want to bump the counters for the CCCS on each Capability - so just iterate through the array of Capabilities and bump each one. I tried adding a new register CCCS to StgRegTable in stg/Regs.h but there is too much hidden knowledge in the code and it isn't working. I got to the point where I have changed every reference to the global variable to this register. The problem is that it isn't getting updated. Debugging a bit it seems that the register is used as it should (the calls to PushCostCentre and AppendCCS from the generated code use the correct value in CCCS) but it isn't getting stored in the register table because every time the timer is called, the value stored is CCS_SYSTEM, the one used in initialization. I tried to mimic how the other registers are implemented but there is no documentation at all so I wasn't sure what was exactly required. Could someone tell where exactly I have to change things or should I post my changes and ask about specific details? ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc
Re: parallel cost-centre profiling
On 16/03/2010 19:34, Henrique Ferreiro wrote: Hello! I am trying to make cost centre profiling work in the threaded rts build in order to use that information to better understand parallel behaviour. Currently I am learning about the internals of GHC and I am thinking about how this could be done. The main blocker is that the current cost centre stack is a shared global variable. The simplest solution I came up with is to convert it to a thread local variable. The problem would be how to access it from the global timer. Yes, basically what you want to do is put CCCS into the StgRegs structure, which will make it thread-local. In the timer signal you want to bump the counters for the CCCS on each Capability - so just iterate through the array of Capabilities and bump each one. I'm sure this isn't all that needs to be done, though. The cost-center stack data structure probably needs some locks; perhaps one global lock will do for a start. The danger here is that contention in the profiling subsystem will obscure the real profilng results you were trying to obtain. Keep us posted! Cheers, Simon As I don't have the full picture yet, I would greatly appreciate if some of you could give me some advice on how to tackle this problem. Would it be possible to use thread local storage for this? Is there a better design? ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc