[ https://issues.apache.org/jira/browse/TS-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Leif Hedstrom updated TS-2089: ------------------------------ Fix Version/s: (was: 4.1.0) 4.2.0 Moving these to v4.2.0, please move any bugs back that will be worked on before the November release of v4.1.0. > one log collation thread is not enough for busy system > ------------------------------------------------------ > > Key: TS-2089 > URL: https://issues.apache.org/jira/browse/TS-2089 > Project: Traffic Server > Issue Type: Improvement > Components: Logging > Reporter: Zhao Yongming > Assignee: Yunkai Zhang > Fix For: 4.2.0 > > Attachments: > 0001-TS-2089-introduce-configurable-collation-preproc-thr.patch, > 0001-TS-2089-introduce-configurable-collation-preproc-thr.patch.V2 > > > we should make log collation more thread & cpu nice. > > {code} > top - 22:34:22 up 3:37, 1 user, load average: 1.20, 1.71, 1.66 > Tasks: 979 total, 2 running, 977 sleeping, 0 stopped, 0 zombie > Cpu(s): 3.5%us, 0.6%sy, 0.0%ni, 95.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > Mem: 65943744k total, 62536348k used, 3407396k free, 90544k buffers > Swap: 2097144k total, 0k used, 2097144k free, 54189788k cached > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 4651 ats 20 0 12.9g 5.3g 4928 R 100.1 8.5 197:51.78 [LOGGING] > 4460 ats 20 0 12.9g 5.3g 4928 S 7.9 8.5 18:12.57 [ET_NET 44] > 4457 ats 20 0 12.9g 5.3g 4928 S 5.3 8.5 7:32.34 [ET_NET 41] > 4461 ats 20 0 12.9g 5.3g 4928 S 4.3 8.5 6:41.79 [ET_NET 45] > 4453 ats 20 0 12.9g 5.3g 4928 S 3.0 8.5 6:44.52 [ET_NET 37] > 4463 ats 20 0 12.9g 5.3g 4928 S 3.0 8.5 7:54.01 [ET_NET 47] > 3652 root 39 19 0 0 0 S 1.0 0.0 2:14.15 kipmi0 > 4425 ats 20 0 12.9g 5.3g 4928 S 0.7 8.5 0:19.21 [ET_NET 9] > 4426 ats 20 0 12.9g 5.3g 4928 S 0.7 8.5 0:19.42 [ET_NET 10] > 4430 ats 20 0 12.9g 5.3g 4928 S 0.7 8.5 0:19.55 [ET_NET 14] > 132 root 20 0 0 0 0 S 0.3 0.0 0:00.82 events/1 > 4417 ats 20 0 12.9g 5.3g 4928 S 0.3 8.5 0:19.82 [ET_NET 1] > 4418 ats 20 0 12.9g 5.3g 4928 S 0.3 8.5 0:20.02 [ET_NET 2] > 4420 ats 20 0 12.9g 5.3g 4928 S 0.3 8.5 0:19.64 [ET_NET 4] > 4422 ats 20 0 12.9g 5.3g 4928 S 0.3 8.5 0:19.84 [ET_NET 6] > 4427 ats 20 0 12.9g 5.3g 4928 S 0.3 8.5 0:19.94 [ET_NET 11] > 4435 ats 20 0 12.9g 5.3g 4928 S 0.3 8.5 0:19.88 [ET_NET 19] > {code} > == Update === > I have developed a patch to fix this issue. After I rebase it to master > branch, I'll attach it to here. Here is the description of my patch: > --------------- > We found that CPU of logging thread could be easy to reach up 100% in > collation host, but disk IO was low at the same time. > The bottleneck of logging thread is that some preprocessing job, such as > convert LogBuffer to ascii text, consume so much CPU time. And more > worse, the write() operation will block logging thread. > So this patch try to split logging thread into two parts: > 1) Configurable preproc threads, which are responsiable for processing all > of prepare work, and then forward the preprocessed buffer to flush thread, > or send them to CollationClient/HostSM. > 2) One Flush thread, it will consume preprocessed buffers and write them to > disk. In our testing, one flush thread is enough for us. > TODO: This patch supports only one flush thread, we can improve it to > "one flush thread per file/disk" in the future. > == How to configure == > The number of preproc threads is 1 by default. > Please modify "proxy.config.log.collation_preproc_threads" option to > change it. > --------------- > ====== > After apply this patch, let's have a look the 'top' output: > {code} > [root@test81.cn8 core]# top -H -p 3984 -b -n 1 > top - 15:39:06 up 17:23, 6 users, load average: 3.37, 3.09, 3.15 > Tasks: 246 total, 6 running, 240 sleeping, 0 stopped, 0 zombie > Cpu(s): 1.4%us, 0.5%sy, 0.0%ni, 97.6%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st > Mem: 65943744k total, 65763084k used, 180660k free, 136688k buffers > Swap: 2097144k total, 0k used, 2097144k free, 57275428k cached > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 4237 ats 20 0 8941m 5.2g 5148 R 39.5 8.3 80:46.23 [LOG_PREPROC 2] > 4236 ats 20 0 8941m 5.2g 5148 R 37.5 8.3 80:46.53 [LOG_PREPROC 1] > 4235 ats 20 0 8941m 5.2g 5148 R 35.6 8.3 80:46.71 [LOG_PREPROC 0] > 4238 ats 20 0 8941m 5.2g 5148 R 35.6 8.3 80:48.37 [LOG_PREPROC 3] > 4239 ats 20 0 8941m 5.2g 5148 R 35.6 8.3 80:46.58 [LOG_PREPROC 4] > 4240 ats 20 0 8941m 5.2g 5148 R 11.9 8.3 20:45.28 [LOG_FLUSH] > 4037 ats 20 0 8941m 5.2g 5148 S 7.9 8.3 15:06.60 [ET_NET 38] > 4045 ats 20 0 8941m 5.2g 5148 S 5.9 8.3 9:55.74 [ET_NET 46] > 4038 ats 20 0 8941m 5.2g 5148 S 4.0 8.3 6:08.97 [ET_NET 39] > 4042 ats 20 0 8941m 5.2g 5148 S 4.0 8.3 8:36.73 [ET_NET 43] > ... > {code} > [LOG_PREPROC 0-4] are collation preproc threads, [LOG_FLUSH] is new flush > thread. They have good throughput now. -- This message was sent by Atlassian JIRA (v6.1#6144)