From: Paul E. McKenney Sent: 08/24/13 11:03 PM > On Sat, Aug 24, 2013 at 09:59:45PM +0200, Tibor Billes wrote: > > From: Paul E. McKenney Sent: 08/22/13 12:09 AM > > > On Wed, Aug 21, 2013 at 11:05:51PM +0200, Tibor Billes wrote: > > > > > From: Paul E. McKenney Sent: 08/21/13 09:12 PM > > > > > On Wed, Aug 21, 2013 at 08:14:46PM +0200, Tibor Billes wrote: > > > > > > > From: Paul E. McKenney Sent: 08/20/13 11:43 PM > > > > > > > On Tue, Aug 20, 2013 at 10:52:26PM +0200, Tibor Billes wrote: > > > > > > > > > From: Paul E. McKenney Sent: 08/20/13 04:53 PM > > > > > > > > > On Tue, Aug 20, 2013 at 08:01:28AM +0200, Tibor Billes wrote: > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I was using the 3.9.7 stable release and tried to upgrade > > > > > > > > > > to the 3.10.x series. > > > > > > > > > > The 3.10.x series was showing unusually high (>75%) system > > > > > > > > > > CPU usage in some > > > > > > > > > > situations, making things really slow. The latest stable I > > > > > > > > > > tried is 3.10.7. > > > > > > > > > > I also tried 3.11-rc5, they both show this behaviour. This > > > > > > > > > > behaviour doesn't > > > > > > > > > > show up when the system is idling, only when doing some CPU > > > > > > > > > > intensive work, > > > > > > > > > > like compiling with multiple threads. Compiling with only > > > > > > > > > > one thread seems not > > > > > > > > > > to trigger this behaviour. > > > > > > > > > > > > > > > > > > > > To be more precise I did a `perf record -a` while compiling > > > > > > > > > > a large C++ program > > > > > > > > > > with scons using 4 threads, the result is appended at the > > > > > > > > > > end of this email. > > > > > > > > > > > > > > > > > > New one on me! You are running a mainstream system (x86_64), > > > > > > > > > so I am > > > > > > > > > surprised no one else noticed. > > > > > > > > > > > > > > > > > > Could you please send along your .config file? > > > > > > > > > > > > > > > > Here it is > > > > > > > > > > > > > > Interesting. I don't see RCU stuff all that high on the list, but > > > > > > > the items I do see lead me to suspect RCU_FAST_NO_HZ, which has > > > > > > > some > > > > > > > relevance to the otherwise inexplicable group of commits you > > > > > > > located > > > > > > > with your bisection. Could you please rerun with > > > > > > > CONFIG_RCU_FAST_NO_HZ=n? > > > > > > > > > > > > > > If that helps, there are some things I could try. > > > > > > > > > > > > It did help. I didn't notice anything unusual when running with > > > > > > CONFIG_RCU_FAST_NO_HZ=n. > > > > > > > > > > Interesting. Thank you for trying this -- and we at least have a > > > > > short-term workaround for this problem. I will put a patch together > > > > > for further investigation. > > > > > > > > I don't specifically need this config option so I'm fine without it in > > > > the long term, but I guess it's not supposed to behave like that. > > > > > > OK, good, we have a long-term workload for your specific case, > > > even better. ;-) > > > > > > But yes, there are situations where RCU_FAST_NO_HZ needs to work > > > a bit better. I hope you will bear with me with a bit more > > > testing... > > > > > > > > In the meantime, could you please tell me how you were measuring > > > > > performance for your kernel builds? Wall-clock time required to > > > > > complete > > > > > one build? Number of builds completed per unit time? Something else? > > > > > > > > Actually, I wasn't all this sophisticated. I have a system monitor > > > > applet on my top panel (using MATE, Linux Mint), four little graphs, > > > > one of which shows CPU usage. Different colors indicate different kind > > > > of CPU usage. Blue shows user space usage, red shows system usage, and > > > > two more for nice and iowait. During a normal compile it's almost > > > > completely filled with blue user space CPU usage, only the top few > > > > pixels show some iowait and system usage. With CONFIG_RCU_FAST_NO_HZ > > > > set, about 3/4 of the graph was red system CPU usage, the rest was > > > > blue. So I just looked for a pile of red on my graphs when I tested > > > > different kernel builds. But also compile speed was horrible I couldn't > > > > wait for the build to finish. Even the UI got unresponsive. > > > > > > We have been having problems with CPU accounting, but this one looks > > > quite real. > > > > > > > Now I did some measuring. In the normal case a compile finished in 36 > > > > seconds, compiled 315 object files. Here are some output lines from > > > > dstat -tasm --vm during compile: > > > > ----system---- ----total-cpu-usage---- -dsk/total- -net/total- > > > > ---paging-- ---system-- ----swap--- ------memory-usage----- > > > > -----virtual-memory---- > > > > time |usr sys idl wai hiq siq| read writ| recv send| in > > > > out | int csw | used free| used buff cach free|majpf minpf alloc > > > > free > > > > 21-08 21:48:05| 91 8 2 0 0 0| 0 5852k| 0 0 | 0 > > > > 0 |1413 1772 | 0 7934M| 581M 58.0M 602M 6553M| 0 71k 46k > > > > 54k > > > > 21-08 21:48:06| 93 6 1 0 0 0| 0 2064k| 137B 131B| 0 > > > > 0 |1356 1650 | 0 7934M| 649M 58.0M 604M 6483M| 0 72k 47k > > > > 28k > > > > 21-08 21:48:07| 86 11 4 0 0 0| 0 5872k| 0 0 | 0 > > > > 0 |2000 2991 | 0 7934M| 577M 58.0M 627M 6531M| 0 99k 67k > > > > 79k > > > > 21-08 21:48:08| 87 9 3 0 0 0| 0 2840k| 0 0 | 0 > > > > 0 |2558 4164 | 0 7934M| 597M 58.0M 632M 6507M| 0 96k 57k > > > > 51k > > > > 21-08 21:48:09| 93 7 1 0 0 0| 0 3032k| 0 0 | 0 > > > > 0 |1329 1512 | 0 7934M| 641M 58.0M 626M 6469M| 0 61k 48k > > > > 39k > > > > 21-08 21:48:10| 93 6 0 0 0 0| 0 4984k| 0 0 | 0 > > > > 0 |1160 1146 | 0 7934M| 572M 58.0M 628M 6536M| 0 50k 40k > > > > 57k > > > > 21-08 21:48:11| 86 9 6 0 0 0| 0 2520k| 0 0 | 0 > > > > 0 |2947 4760 | 0 7934M| 605M 58.0M 631M 6500M| 0 103k 55k > > > > 45k > > > > 21-08 21:48:12| 90 8 2 0 0 0| 0 2840k| 0 0 | 0 > > > > 0 |2674 4179 | 0 7934M| 671M 58.0M 635M 6431M| 0 84k 59k > > > > 42k > > > > 21-08 21:48:13| 90 9 1 0 0 0| 0 4656k| 0 0 | 0 > > > > 0 |1223 1410 | 0 7934M| 643M 58.0M 638M 6455M| 0 90k 62k > > > > 68k > > > > 21-08 21:48:14| 91 8 1 0 0 0| 0 3572k| 0 0 | 0 > > > > 0 |1432 1828 | 0 7934M| 647M 58.0M 641M 6447M| 0 81k 59k > > > > 57k > > > > 21-08 21:48:15| 91 8 1 0 0 0| 0 5116k| 116B 0 | 0 > > > > 0 |1194 1295 | 0 7934M| 605M 58.0M 644M 6487M| 0 69k 54k > > > > 64k > > > > 21-08 21:48:16| 87 10 3 0 0 0| 0 5140k| 0 0 | 0 > > > > 0 |1761 2586 | 0 7934M| 584M 58.0M 650M 6502M| 0 105k 64k > > > > 68k > > > > > > > > The abnormal case compiled only 182 object file in 6 and a half minutes, > > > > then I stopped it. The same dstat output for this case: > > > > ----system---- ----total-cpu-usage---- -dsk/total- -net/total- > > > > ---paging-- ---system-- ----swap--- ------memory-usage----- > > > > -----virtual-memory---- > > > > time |usr sys idl wai hiq siq| read writ| recv send| in > > > > out | int csw | used free| used buff cach free|majpf minpf alloc > > > > free > > > > 21-08 22:10:49| 27 62 0 0 11 0| 0 0 | 210B 0 | 0 > > > > 0 |1414 3137k| 0 7934M| 531M 57.6M 595M 6611M| 0 1628 1250 > > > > 322 > > > > 21-08 22:10:50| 25 60 4 0 11 0| 0 88k| 126B 0 | 0 > > > > 0 |1337 3110k| 0 7934M| 531M 57.6M 595M 6611M| 0 91 128 > > > > 115 > > > > 21-08 22:10:51| 26 63 0 0 11 0| 0 184k| 294B 0 | 0 > > > > 0 |1411 3147k| 0 7934M| 531M 57.6M 595M 6611M| 0 1485 814 > > > > 815 > > > > 21-08 22:10:52| 26 63 0 0 11 0| 0 0 | 437B 239B| 0 > > > > 0 |1355 3160k| 0 7934M| 531M 57.6M 595M 6611M| 0 24 94 > > > > 97 > > > > 21-08 22:10:53| 26 63 0 0 11 0| 0 0 | 168B 0 | 0 > > > > 0 |1397 3155k| 0 7934M| 531M 57.6M 595M 6611M| 0 479 285 > > > > 273 > > > > 21-08 22:10:54| 26 63 0 0 11 0| 0 4096B| 396B 324B| 0 > > > > 0 |1346 3154k| 0 7934M| 531M 57.6M 595M 6611M| 0 27 145 > > > > 145 > > > > 21-08 22:10:55| 26 63 0 0 11 0| 0 60k| 0 0 | 0 > > > > 0 |1353 3148k| 0 7934M| 531M 57.6M 595M 6610M| 0 93 117 > > > > 36 > > > > 21-08 22:10:56| 26 63 0 0 11 0| 0 0 | 0 0 | 0 > > > > 0 |1341 3172k| 0 7934M| 531M 57.6M 595M 6610M| 0 158 87 > > > > 74 > > > > 21-08 22:10:57| 26 62 1 0 11 0| 0 0 | 42B 60B| 0 > > > > 0 |1332 3162k| 0 7934M| 531M 57.6M 595M 6610M| 0 56 82 > > > > 78 > > > > 21-08 22:10:58| 26 63 0 0 11 0| 0 0 | 0 0 | 0 > > > > 0 |1334 3178k| 0 7934M| 531M 57.6M 595M 6610M| 0 26 56 > > > > 56 > > > > 21-08 22:10:59| 26 63 0 0 11 0| 0 0 | 0 0 | 0 > > > > 0 |1336 3179k| 0 7934M| 531M 57.6M 595M 6610M| 0 3 33 > > > > 32 > > > > 21-08 22:11:00| 26 63 0 0 11 0| 0 24k| 90B 108B| 0 > > > > 0 |1347 3172k| 0 7934M| 531M 57.6M 595M 6610M| 0 41 73 > > > > 71 > > > > > > > > I have four logical cores so 25% makes up 1 core. I don't know if the > > > > ~26% user CPU usage has anthing to do with this fact or just > > > > coincidence. The rest is ~63% system and ~11% hardware interrupt. Do > > > > these support what you suspect? > > > > > > The massive increase in context switches does come as a bit of a surprise! > > > It does rule out my initial suspicion of lock contention, but then again > > > the fact that you have only four CPUs made that pretty unlikely to begin > > > with. > > > > > > 2.4k average context switches in the good case for the full run vs. 3,156k > > > for about half of a run in the bad case. That is an increase of more > > > than three orders of magnitude! > > > > > > Yow!!! > > > > > > Page faults are actually -higher- in the good case. You have about 6.5GB > > > free in both cases, so you are not running out of memory. Lots more disk > > > writes in the good case, perhaps consistent with its getting more done. > > > Networking is negligible in both cases. > > > > > > Lots of hardware interrupts in the bad case as well. Would you be willing > > > to take a look at /proc/interrupts before and after to see which one you > > > are getting hit with? (Or whatever interrupt tracking tool you prefer.) > > > > Here are the results. > > > > Good case before: > > CPU0 CPU1 CPU2 CPU3 > > 0: 17 0 0 0 IO-APIC-edge timer > > 1: 356 1 68 4 IO-APIC-edge i8042 > > 8: 0 0 1 0 IO-APIC-edge rtc0 > > 9: 330 14 449 71 IO-APIC-fasteoi acpi > > 12: 10 108 269 2696 IO-APIC-edge i8042 > > 16: 36 10 111 2 IO-APIC-fasteoi > > ehci_hcd:usb1 > > 17: 20 3 25 4 IO-APIC-fasteoi mmc0 > > 21: 3 0 34 0 IO-APIC-fasteoi > > ehci_hcd:usb2 > > 40: 0 1 12 11 PCI-MSI-edge mei_me > > 41: 10617 173 9959 292 PCI-MSI-edge ahci > > 42: 862 11 186 26 PCI-MSI-edge xhci_hcd > > 43: 107 77 27 102 PCI-MSI-edge i915 > > 44: 5322 20 434 22 PCI-MSI-edge iwlwifi > > 45: 180 0 183 86 PCI-MSI-edge > > snd_hda_intel > > 46: 0 3 0 0 PCI-MSI-edge eth0 > > NMI: 1 0 0 0 Non-maskable interrupts > > LOC: 16312 15177 10840 8995 Local timer interrupts > > SPU: 0 0 0 0 Spurious interrupts > > PMI: 1 0 0 0 Performance monitoring > > interrupts > > IWI: 1160 523 1031 481 IRQ work interrupts > > RTR: 3 0 0 0 APIC ICR read retries > > RES: 14976 16135 9973 10784 Rescheduling interrupts > > CAL: 482 457 151 370 Function call interrupts > > TLB: 70 106 352 230 TLB shootdowns > > TRM: 0 0 0 0 Thermal event interrupts > > THR: 0 0 0 0 Threshold APIC interrupts > > MCE: 0 0 0 0 Machine check exceptions > > MCP: 2 2 2 2 Machine check polls > > ERR: 0 > > MIS: 0 > > > > Good case after: > > CPU0 CPU1 CPU2 CPU3 > > 0: 17 0 0 0 IO-APIC-edge timer > > 1: 367 1 81 4 IO-APIC-edge i8042 > > 8: 0 0 1 0 IO-APIC-edge rtc0 > > 9: 478 14 460 71 IO-APIC-fasteoi acpi > > 12: 10 108 269 2696 IO-APIC-edge i8042 > > 16: 36 10 111 2 IO-APIC-fasteoi > > ehci_hcd:usb1 > > 17: 20 3 25 4 IO-APIC-fasteoi mmc0 > > 21: 3 0 34 0 IO-APIC-fasteoi > > ehci_hcd:usb2 > > 40: 0 1 12 11 PCI-MSI-edge mei_me > > 41: 16888 173 9959 292 PCI-MSI-edge ahci > > 42: 1102 11 186 26 PCI-MSI-edge xhci_hcd > > 43: 107 132 27 136 PCI-MSI-edge i915 > > 44: 6943 20 434 22 PCI-MSI-edge iwlwifi > > 45: 180 0 183 86 PCI-MSI-edge > > snd_hda_intel > > 46: 0 3 0 0 PCI-MSI-edge eth0 > > NMI: 4 3 3 3 Non-maskable interrupts > > LOC: 26845 24780 19025 17746 Local timer interrupts > > SPU: 0 0 0 0 Spurious interrupts > > PMI: 4 3 3 3 Performance monitoring > > interrupts > > IWI: 1637 751 1287 695 IRQ work interrupts > > RTR: 3 0 0 0 APIC ICR read retries > > RES: 26511 26673 18791 20194 Rescheduling interrupts > > CAL: 510 480 151 370 Function call interrupts > > TLB: 361 292 575 461 TLB shootdowns > > TRM: 0 0 0 0 Thermal event interrupts > > THR: 0 0 0 0 Threshold APIC interrupts > > MCE: 0 0 0 0 Machine check exceptions > > MCP: 2 2 2 2 Machine check polls > > ERR: 0 > > MIS: 0 > > > > Bad case before: > > CPU0 CPU1 CPU2 CPU3 > > 0: 17 0 0 0 IO-APIC-edge timer > > 1: 172 3 78 3 IO-APIC-edge i8042 > > 8: 0 1 0 0 IO-APIC-edge rtc0 > > 9: 1200 148 395 81 IO-APIC-fasteoi acpi > > 12: 1625 2 348 10 IO-APIC-edge i8042 > > 16: 26 23 115 0 IO-APIC-fasteoi > > ehci_hcd:usb1 > > 17: 16 3 12 21 IO-APIC-fasteoi mmc0 > > 21: 2 2 33 0 IO-APIC-fasteoi > > ehci_hcd:usb2 > > 40: 0 0 14 10 PCI-MSI-edge mei_me > > 41: 15776 374 8497 687 PCI-MSI-edge ahci > > 42: 1297 829 115 24 PCI-MSI-edge xhci_hcd > > 43: 103 149 9 212 PCI-MSI-edge i915 > > 44: 13151 101 511 91 PCI-MSI-edge iwlwifi > > 45: 153 159 0 122 PCI-MSI-edge > > snd_hda_intel > > 46: 0 1 1 0 PCI-MSI-edge eth0 > > NMI: 32 31 31 31 Non-maskable interrupts > > LOC: 82504 82732 74172 75985 Local timer interrupts > > SPU: 0 0 0 0 Spurious interrupts > > PMI: 32 31 31 31 Performance monitoring > > interrupts > > IWI: 17816 16278 13833 13282 IRQ work interrupts > > RTR: 3 0 0 0 APIC ICR read retries > > RES: 18784 21084 13313 12946 Rescheduling interrupts > > CAL: 393 422 306 356 Function call interrupts > > TLB: 231 176 235 191 TLB shootdowns > > TRM: 0 0 0 0 Thermal event interrupts > > THR: 0 0 0 0 Threshold APIC interrupts > > MCE: 0 0 0 0 Machine check exceptions > > MCP: 3 3 3 3 Machine check polls > > ERR: 0 > > MIS: 0 > > > > Bad case after: > > CPU0 CPU1 CPU2 CPU3 > > 0: 17 0 0 0 IO-APIC-edge timer > > 1: 415 3 85 3 IO-APIC-edge i8042 > > 8: 0 1 0 0 IO-APIC-edge rtc0 > > 9: 1277 148 428 81 IO-APIC-fasteoi acpi > > 12: 1625 2 348 10 IO-APIC-edge i8042 > > 16: 26 23 115 0 IO-APIC-fasteoi > > ehci_hcd:usb1 > > 17: 16 3 12 21 IO-APIC-fasteoi mmc0 > > 21: 2 2 33 0 IO-APIC-fasteoi > > ehci_hcd:usb2 > > 40: 0 0 14 10 PCI-MSI-edge mei_me > > 41: 17814 374 8497 687 PCI-MSI-edge ahci > > 42: 1567 829 115 24 PCI-MSI-edge xhci_hcd > > 43: 103 177 9 242 PCI-MSI-edge i915 > > 44: 14956 101 511 91 PCI-MSI-edge iwlwifi > > 45: 153 159 0 122 PCI-MSI-edge > > snd_hda_intel > > 46: 0 1 1 0 PCI-MSI-edge eth0 > > NMI: 36 35 34 34 Non-maskable interrupts > > LOC: 92429 92708 81714 84071 Local timer interrupts > > SPU: 0 0 0 0 Spurious interrupts > > PMI: 36 35 34 34 Performance monitoring > > interrupts > > IWI: 22594 19658 17439 14257 IRQ work interrupts > > RTR: 3 0 0 0 APIC ICR read retries > > RES: 21491 24670 14618 14569 Rescheduling interrupts > > CAL: 441 439 306 356 Function call interrupts > > TLB: 232 181 274 465 TLB shootdowns > > TRM: 0 0 0 0 Thermal event interrupts > > THR: 0 0 0 0 Threshold APIC interrupts > > MCE: 0 0 0 0 Machine check exceptions > > MCP: 3 3 3 3 Machine check polls > > ERR: 0 > > MIS: 0 > > Lots more local timer interrupts, which is consistent with the higher > time in interrupt handlers for the bad case. > > > > One hypothesis is that your workload and configuration are interacting > > > with RCU_FAST_NO_HZ to force very large numbers of RCU grace periods. > > > Could you please check for this by building with CONFIG_RCU_TRACE=y, > > > mounting debugfs somewhere and dumping rcu/rcu_sched/rcugp before and > > > after each run? > > > > Good case before: > > completed=8756 gpnum=8757 age=0 max=21 > > after: > > completed=14686 gpnum=14687 age=0 max=21 > > > > Bad case before: > > completed=22970 gpnum=22971 age=0 max=21 > > after: > > completed=26110 gpnum=26111 age=0 max=21 > > In the good case, (14686-8756)/40=148.25 grace periods per second, which > is a fast but reasonable rate given your HZ=250. Not a large difference > in the number of grace periods, but extrapolating for the longer runtime, > maybe ten times as much. But not much change in grace-period rate per > unit time. > > > The test scenario was the following in both cases (mixed english and > > pseudo-bash): > > reboot, login, start terminal > > cd project > > rm -r build > > cat /proc/interrupts >> somefile ; cat > > /sys/kernel/debug/rcu/rcu_sched/rcugp >> somefile > > scons -j4 > > wait ~40 sec (good case finished, Ctrl-C in bad case) > > cat /proc/interrupts >> somefile ; cat > > /sys/kernel/debug/rcu/rcu_sched/rcugp >> somefile > > > > I stopped the build in the bad case after about the same time the good > > case finished, so the extra interrupts and RCU grace periods because of the > > longer runtime don't fake the results. > > That procedure works for me, thank you for laying it out carefully. > > I believe I see what is going on and how to fix it, though it may take > me a bit to work things through and get a good patch. > > Thank you very much for your testing efforts!
I'm glad I can help. I've been using Linux for many years, now I have a chance to help the community, to do something in return. I'm quite enjoying this :) Tibor -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/