That's great. Thanks for creating this great piece of software! Amin
On Wed, Mar 25, 2020 at 5:56 PM Matthew Knepley <knep...@gmail.com> wrote: > On Wed, Mar 25, 2020 at 5:41 PM Amin Sadeghi <aminthefr...@gmail.com> > wrote: > >> Junchao, thank you for doing the experiment, I guess TACC Frontera nodes >> have higher memory bandwidth (maybe more modern CPU architecture, although >> I'm not familiar as to which hardware affect memory bandwidth) than Compute >> Canada's Graham. >> >> Mark, I did as you suggested. As you suspected, running make streams >> yielded the same results, indicating that the memory bandwidth saturated at >> around 8 MPI processes. I ran the experiment on multiple nodes but only >> requested 8 cores per node, and here is the result: >> >> 1 node (8 cores total): 17.5s, 6X speedup >> 2 nodes (16 cores total): 13.5s, 7X speedup >> 3 nodes (24 cores total): 9.4s, 10X speedup >> 4 nodes (32 cores total): 8.3s, 12X speedup >> 5 nodes (40 cores total): 7.0s, 14X speedup >> 6 nodes (48 cores total): 61.4s, 2X speedup [!!!] >> 7 nodes (56 cores total): 4.3s, 23X speedup >> 8 nodes (64 cores total): 3.7s, 27X speedup >> >> *Note:* as you can see, the experiment with 6 nodes showed extremely >> poor scaling, which I guess was an outlier, maybe due to some connection >> problem? >> >> I also ran another experiment, requesting 2 full nodes, i.e. 64 cores, >> and here's the result: >> >> 2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node] >> >> So, it turns out that given a fixed number of cores, i.e. 64 in our case, >> much better speedups (27X vs. 16X in our case) can be achieved if they are >> distributed among separate nodes. >> >> Anyways, I really appreciate all your inputs. >> >> *One final question:* From what I understand from Mark's comment, PETSc >> at the moment is blind to memory hierarchy, is it feasible to make PETSc >> aware of the inter and intra node communication so that partitioning is >> done to maximize performance? Or, to put it differently, is this something >> that PETSc devs have their eyes on for the future? >> > > There is already stuff in VecScatter that knows about the memory > hierarchy, which Junchao put in. We are actively working on some other > node-aware algorithms. > > Thanks, > > Matt > > >> Sincerely, >> Amin >> >> >> On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang <junchao.zh...@gmail.com> >> wrote: >> >>> I repeated your experiment on one node of TACC Frontera, >>> 1 rank: 85.0s >>> 16 ranks: 8.2s, 10x speedup >>> 32 ranks: 5.7s, 15x speedup >>> >>> --Junchao Zhang >>> >>> >>> On Wed, Mar 25, 2020 at 1:18 PM Mark Adams <mfad...@lbl.gov> wrote: >>> >>>> Also, a better test is see where streams pretty much saturates, then >>>> run that many processors per node and do the same test by increasing the >>>> nodes. This will tell you how well your network communication is doing. >>>> >>>> But this result has a lot of stuff in "network communication" that can >>>> be further evaluated. The worst thing about this, I would think, is that >>>> the partitioning is blind to the memory hierarchy of inter and intra node >>>> communication. The next thing to do is run with an initial grid that puts >>>> one cell per node and the do uniform refinement, until you have one cell >>>> per process (eg, one refinement step using 8 processes per node), partition >>>> to get one cell per process, then do uniform refinement to get a >>>> reasonable sized local problem. Alas, this is not easy to do, but it is >>>> doable. >>>> >>>> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams <mfad...@lbl.gov> wrote: >>>> >>>>> I would guess that you are saturating the memory bandwidth. After >>>>> you make PETSc (make all) it will suggest that you test it (make test) and >>>>> suggest that you run streams (make streams). >>>>> >>>>> I see Matt answered but let me add that when you make streams you will >>>>> seed the memory rate for 1,2,3, ... NP processes. If your machine is >>>>> decent >>>>> you should see very good speed up at the beginning and then it will start >>>>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I >>>>> would expect that you will see something similar with streams. Without >>>>> knowing your machine, your results look typical. >>>>> >>>>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi <aminthefr...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I ran KSP example 45 on a single node with 32 cores and 125GB memory >>>>>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent >>>>>> during KSP.solve: >>>>>> >>>>>> - 1 MPI process: ~98 sec, speedup: 1X >>>>>> - 16 MPI processes: ~12 sec, speedup: ~8X >>>>>> - 32 MPI processes: ~11 sec, speedup: ~9X >>>>>> >>>>>> Since the problem size is large enough (8M unknowns), I expected a >>>>>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how >>>>>> can it be improved? >>>>>> >>>>>> I've attached three log files for more details. >>>>>> >>>>>> Sincerely, >>>>>> Amin >>>>>> >>>>> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> >