Hi Matt, Thanks for your suggestions. Here is the output from Stream test on one node which has 20 cores. I run it up to 20. Attached are the dumped output with your suggested options. Really appreciate your help!!!
Number of MPI processes 1 Function Rate (MB/s) Copy: 13816.9372 Scale: 8020.1809 Add: 12762.3830 Triad: 11852.5016 Number of MPI processes 2 Function Rate (MB/s) Copy: 22748.7681 Scale: 14081.4906 Add: 18998.4516 Triad: 18303.2494 Number of MPI processes 3 Function Rate (MB/s) Copy: 34045.2510 Scale: 23410.9767 Add: 30320.2702 Triad: 30163.7977 Number of MPI processes 4 Function Rate (MB/s) Copy: 36875.5349 Scale: 29440.1694 Add: 36971.1860 Triad: 37377.0103 Number of MPI processes 5 Function Rate (MB/s) Copy: 32272.8763 Scale: 30316.3435 Add: 38022.0193 Triad: 38815.4830 Number of MPI processes 6 Function Rate (MB/s) Copy: 35619.8925 Scale: 34457.5078 Add: 41419.3722 Triad: 35825.3621 Number of MPI processes 7 Function Rate (MB/s) Copy: 55284.2420 Scale: 47706.8009 Add: 59076.4735 Triad: 61680.5559 Number of MPI processes 8 Function Rate (MB/s) Copy: 44525.8901 Scale: 48949.9599 Add: 57437.7784 Triad: 56671.0593 Number of MPI processes 9 Function Rate (MB/s) Copy: 34375.7364 Scale: 29507.5293 Add: 45405.3120 Triad: 39518.7559 Number of MPI processes 10 Function Rate (MB/s) Copy: 34278.0415 Scale: 41721.7843 Add: 46642.2465 Triad: 45454.7000 Number of MPI processes 11 Function Rate (MB/s) Copy: 38093.7244 Scale: 35147.2412 Add: 45047.0853 Triad: 44983.2013 Number of MPI processes 12 Function Rate (MB/s) Copy: 39750.8760 Scale: 52038.0631 Add: 55552.9503 Triad: 54884.3839 Number of MPI processes 13 Function Rate (MB/s) Copy: 60839.0248 Scale: 74143.7458 Add: 85545.3135 Triad: 85667.6551 Number of MPI processes 14 Function Rate (MB/s) Copy: 37766.2343 Scale: 40279.1928 Add: 49992.8572 Triad: 50303.4809 Number of MPI processes 15 Function Rate (MB/s) Copy: 49762.3670 Scale: 59077.8251 Add: 60407.9651 Triad: 61691.9456 Number of MPI processes 16 Function Rate (MB/s) Copy: 31996.7169 Scale: 36962.4860 Add: 40183.5060 Triad: 41096.0512 Number of MPI processes 17 Function Rate (MB/s) Copy: 36348.3839 Scale: 39108.6761 Add: 46853.4476 Triad: 47266.1778 Number of MPI processes 18 Function Rate (MB/s) Copy: 40438.7558 Scale: 43195.5785 Add: 53063.4321 Triad: 53605.0293 Number of MPI processes 19 Function Rate (MB/s) Copy: 30739.4908 Scale: 34280.8118 Add: 40710.5155 Triad: 43330.9503 Number of MPI processes 20 Function Rate (MB/s) Copy: 37488.3777 Scale: 41791.8999 Add: 49518.9604 Triad: 48908.2677 ------------------------------------------------ np speedup 1 1.0 2 1.54 3 2.54 4 3.15 5 3.27 6 3.02 7 5.2 8 4.78 9 3.33 10 3.84 11 3.8 12 4.63 13 7.23 14 4.24 15 5.2 16 3.47 17 3.99 18 4.52 19 3.66 20 4.13 Sincerely Yours, Lei Shi --------- On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <[email protected]> wrote: > On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <[email protected]> wrote: > >> Hello, >> > > 1) In order to understand this, we have to disentagle the various effect. > First, run the STREAMS benchmark > > make NPMAX=4 streams > > This will tell you the maximum speedup you can expect on this machine. > > 2) For these test cases, also send the output of > > -ksp_view -ksp_converged_reason -ksp_monitor_true_residual > > Thanks, > > Matt > > >> I'm trying to improve the parallel efficiency of gmres solve in my. In my >> CFD solver, Petsc gmres is used to solve the linear system generated by the >> Newton's method. To test its efficiency, I started with a very simple >> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of >> gmres solve with asm as the preconditioner is very bad. The results are >> from our latest cluster. Right now, I'm only looking at the wclock time of >> the ksp_solve. >> >> 1. First I tested ASM with gmres and ilu 0 for the sub domain , the >> cpu time of 2 cores is almost the same as the serial run. Here is the >> options for this case >> >> -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50 >> -ksp_gmres_restart 30 -ksp_pc_side right >> -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30 >> -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0 >> -sub_pc_factor_fill 1.9 >> >> The iteration numbers increase a lot for parallel run. >> >> coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-0411.95 >> 1252.05E-0210.51.010.50462.19E-027.641.390.34 >> >> >> >> >> >> >> 2. Then I tested ASM with ilu 0 as the preconditoner only, the cpu >> time of 2 cores is better than the 1st test, but the speedup is still very >> bad. Here is the options i'm using >> >> -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50 >> -ksp_gmres_restart 30 -ksp_pc_side right >> -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0 >> -sub_pc_factor_fill 1.9 >> >> coresiterationserrpetsc solve cpu timespeedupefficiency1104.54E-0410.6812 >> 119.55E-048.21.300.654123.59E-045.262.030.50 >> >> >> >> >> >> >> Those results are from a third order "DG" scheme with a very coarse 3D >> mesh (480 elements). I believe I should get some speedups for this test >> even on this coarse mesh. >> >> My question is why does the asm with a local solve take much longer >> time than the asm as a preconditioner only? Also the accuracy is very bad >> too I have tested changing the overlap of asm to 2, but make it even worse. >> >> If I used a larger mesh ~4000 elements, the 2nd case with asm as the >> preconditioner gives me a better speedup, but still not very good. >> >> >> coresiterationserrpetsc solve cpu timespeedupefficiency171.91E-0297.32127 >> 2.07E-0264.941.50.74472.61E-0236.972.60.65 >> >> >> >> Attached are the log_summary dumped from petsc, any suggestions are >> welcome. I really appreciate it. >> >> >> Sincerely Yours, >> >> Lei Shi >> --------- >> > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener >
proc2_asm_sub_ksp.dat
Description: Binary data
proc2_asm_pconly.dat
Description: Binary data
proc1_asm_sub_ksp.dat
Description: Binary data
proc1_asm_pconly.dat
Description: Binary data
