Hi Lei, Depending on your machine and MPI library, you may have to use smart process to core/socket bindings to achieve better speedup. Instructions can be found here:
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers Justin On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <[email protected]> wrote: > Hi Matt, > > Thanks for your suggestions. Here is the output from Stream test on one > node which has 20 cores. I run it up to 20. Attached are the dumped output > with your suggested options. Really appreciate your help!!! > > Number of MPI processes 1 > Function Rate (MB/s) > Copy: 13816.9372 > Scale: 8020.1809 > Add: 12762.3830 > Triad: 11852.5016 > > Number of MPI processes 2 > Function Rate (MB/s) > Copy: 22748.7681 > Scale: 14081.4906 > Add: 18998.4516 > Triad: 18303.2494 > > Number of MPI processes 3 > Function Rate (MB/s) > Copy: 34045.2510 > Scale: 23410.9767 > Add: 30320.2702 > Triad: 30163.7977 > > Number of MPI processes 4 > Function Rate (MB/s) > Copy: 36875.5349 > Scale: 29440.1694 > Add: 36971.1860 > Triad: 37377.0103 > > Number of MPI processes 5 > Function Rate (MB/s) > Copy: 32272.8763 > Scale: 30316.3435 > Add: 38022.0193 > Triad: 38815.4830 > > Number of MPI processes 6 > Function Rate (MB/s) > Copy: 35619.8925 > Scale: 34457.5078 > Add: 41419.3722 > Triad: 35825.3621 > > Number of MPI processes 7 > Function Rate (MB/s) > Copy: 55284.2420 > Scale: 47706.8009 > Add: 59076.4735 > Triad: 61680.5559 > > Number of MPI processes 8 > Function Rate (MB/s) > Copy: 44525.8901 > Scale: 48949.9599 > Add: 57437.7784 > Triad: 56671.0593 > > Number of MPI processes 9 > Function Rate (MB/s) > Copy: 34375.7364 > Scale: 29507.5293 > Add: 45405.3120 > Triad: 39518.7559 > > Number of MPI processes 10 > Function Rate (MB/s) > Copy: 34278.0415 > Scale: 41721.7843 > Add: 46642.2465 > Triad: 45454.7000 > > Number of MPI processes 11 > Function Rate (MB/s) > Copy: 38093.7244 > Scale: 35147.2412 > Add: 45047.0853 > Triad: 44983.2013 > > Number of MPI processes 12 > Function Rate (MB/s) > Copy: 39750.8760 > Scale: 52038.0631 > Add: 55552.9503 > Triad: 54884.3839 > > Number of MPI processes 13 > Function Rate (MB/s) > Copy: 60839.0248 > Scale: 74143.7458 > Add: 85545.3135 > Triad: 85667.6551 > > Number of MPI processes 14 > Function Rate (MB/s) > Copy: 37766.2343 > Scale: 40279.1928 > Add: 49992.8572 > Triad: 50303.4809 > > Number of MPI processes 15 > Function Rate (MB/s) > Copy: 49762.3670 > Scale: 59077.8251 > Add: 60407.9651 > Triad: 61691.9456 > > Number of MPI processes 16 > Function Rate (MB/s) > Copy: 31996.7169 > Scale: 36962.4860 > Add: 40183.5060 > Triad: 41096.0512 > > Number of MPI processes 17 > Function Rate (MB/s) > Copy: 36348.3839 > Scale: 39108.6761 > Add: 46853.4476 > Triad: 47266.1778 > > Number of MPI processes 18 > Function Rate (MB/s) > Copy: 40438.7558 > Scale: 43195.5785 > Add: 53063.4321 > Triad: 53605.0293 > > Number of MPI processes 19 > Function Rate (MB/s) > Copy: 30739.4908 > Scale: 34280.8118 > Add: 40710.5155 > Triad: 43330.9503 > > Number of MPI processes 20 > Function Rate (MB/s) > Copy: 37488.3777 > Scale: 41791.8999 > Add: 49518.9604 > Triad: 48908.2677 > ------------------------------------------------ > np speedup > 1 1.0 > 2 1.54 > 3 2.54 > 4 3.15 > 5 3.27 > 6 3.02 > 7 5.2 > 8 4.78 > 9 3.33 > 10 3.84 > 11 3.8 > 12 4.63 > 13 7.23 > 14 4.24 > 15 5.2 > 16 3.47 > 17 3.99 > 18 4.52 > 19 3.66 > 20 4.13 > > > > > Sincerely Yours, > > Lei Shi > --------- > > On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <[email protected]> > wrote: > >> On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <[email protected]> wrote: >> >>> Hello, >>> >> >> 1) In order to understand this, we have to disentagle the various effect. >> First, run the STREAMS benchmark >> >> make NPMAX=4 streams >> >> This will tell you the maximum speedup you can expect on this machine. >> >> 2) For these test cases, also send the output of >> >> -ksp_view -ksp_converged_reason -ksp_monitor_true_residual >> >> Thanks, >> >> Matt >> >> >>> I'm trying to improve the parallel efficiency of gmres solve in my. In >>> my CFD solver, Petsc gmres is used to solve the linear system generated by >>> the Newton's method. To test its efficiency, I started with a very simple >>> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of >>> gmres solve with asm as the preconditioner is very bad. The results are >>> from our latest cluster. Right now, I'm only looking at the wclock time of >>> the ksp_solve. >>> >>> 1. First I tested ASM with gmres and ilu 0 for the sub domain , the >>> cpu time of 2 cores is almost the same as the serial run. Here is the >>> options for this case >>> >>> -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50 >>> -ksp_gmres_restart 30 -ksp_pc_side right >>> -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30 >>> -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0 >>> -sub_pc_factor_fill 1.9 >>> >>> The iteration numbers increase a lot for parallel run. >>> >>> coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-04 >>> 11.951252.05E-0210.51.010.50462.19E-027.641.390.34 >>> >>> >>> >>> >>> >>> >>> 2. Then I tested ASM with ilu 0 as the preconditoner only, the >>> cpu time of 2 cores is better than the 1st test, but the speedup is still >>> very bad. Here is the options i'm using >>> >>> -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50 >>> -ksp_gmres_restart 30 -ksp_pc_side right >>> -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0 >>> -sub_pc_factor_fill 1.9 >>> >>> coresiterationserrpetsc solve cpu timespeedupefficiency1104.54E-0410.681 >>> 2119.55E-048.21.300.654123.59E-045.262.030.50 >>> >>> >>> >>> >>> >>> >>> Those results are from a third order "DG" scheme with a very coarse >>> 3D mesh (480 elements). I believe I should get some speedups for this test >>> even on this coarse mesh. >>> >>> My question is why does the asm with a local solve take much longer >>> time than the asm as a preconditioner only? Also the accuracy is very bad >>> too I have tested changing the overlap of asm to 2, but make it even worse. >>> >>> If I used a larger mesh ~4000 elements, the 2nd case with asm as the >>> preconditioner gives me a better speedup, but still not very good. >>> >>> >>> coresiterationserrpetsc solve cpu timespeedupefficiency171.91E-0297.3212 >>> 72.07E-0264.941.50.74472.61E-0236.972.60.65 >>> >>> >>> >>> Attached are the log_summary dumped from petsc, any suggestions are >>> welcome. I really appreciate it. >>> >>> >>> Sincerely Yours, >>> >>> Lei Shi >>> --------- >>> >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> > >
