Does your machine have multiple numa domains (more than 1 CPU)? If so, then parallel initialization could be causing you to have bad memory affinity for your serial for-loops.
Switching to parallel initialization resulted in a ~2X speedup for our stream benchmark because it meant our memory first-touch now matched how subsequent parallel loops were accessing memory. You could be seeing the opposite, where serial first-touch is really what you want, but parallel initialization is resulting in bad affinity for your for-loops. http://chapel.cray.com/releaseNotes/1.12/05-Optimizations.pdf (slides 15-20) has some more info on our decision to switch to parallel array initialization by default, and the performance impact it had on stream. To minimize mailing list noise, feel free to send the video off-list and once we figure out what's going on we can send a summary for those who might be interested. Elliot >I actually don't think there's any problem with parallel initialization. It >could well be happening, but shouldn't be causing a dramatic slow down. > >I would like to send you a short video clip tomorrow to illustrate what I am >seeing on my machine. It is possible there is something machine dependent? We >are using a new Skylake machine. > >-- Dave W (sent from my phone, so please excuse brevity, speak-o's, and >swype-o's) > >On July 12, 2016 6:41:37 PM EDT, Elliot Ronaghan <[email protected]> wrote: > > >>>It looks like the >>> >>>-sparallelInitElts=false >>> >>>setting restored our performance to what we got with 1.11. Surprisingly (to >>>me), without this flag, the code seems to be using multicore execution of the >>>loops I've written, as well as perhaps the array initialization (at least, >>>that's what it looks like when I watch htop as I run it, and all 4 cores run >>>at >>>100% right up until it quits 12 seconds after starting). That would probably >>>explain the terrible performance, as the untiled loop nest would probably >>>cause >>>terrible contention for cache lines if run concurrently. >> >> >> >>That flag only impacts array initialization. Your for-loops will still run >>serially (Chapel, very intentionally, does not auto-parallelize anything.) >>When I run, I see a spike for all cores during array >initialization, then >>just one core busy for the rest of the program. >> >> >>I'm attaching our code, but my question may now just be: how to I prevent >>concurrent execution (and the answer may be, with the flag above). >> >> >> >>For now, I'd just use -sparallelInitElts=false. Currently there's no way to >>squash the default array initialization, but we're working on that. In the >>future you should be able to get serial array init by doing something like: >> >> // replace default init with manual serial init >> var m: [MatrixD] int = for i in MatrixD do 0; ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports.http://sdm.link/zohodev2dev _______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
