I still had a shared-memory model in mind, which is totally wrong for Julia of course. There will be nworkers() copies of A in cache instead of one. I thus believe the problem is indeed memory-bandwidth. A quick test with a cpu-intensive example does give the expected scaling.
Thanks, Tom
