keith-turner commented on issue #5712: URL: https://github.com/apache/accumulo/issues/5712#issuecomment-3050635696
> I also ran the bulk random walk test with Accumulo version 2.1.4rc0 and Zookeeper version 3.9.2, and saw zookeeper's CPU usage fairly high across the whole of the test. That was helpful, made me realize it was a more general problem than just after things got stuck. Looked into this some more and enabled zookeeper server trace logging while running the test. Looking at the ZK trace logs, saw that getChildren was the most frequent operation. ``` $ grep reqpath zookeeper_trace.log | grep -o -E "type:[a-zA-Z]+" | sort | uniq -c 10605 type:create 8429 type:delete 59636 type:exists 11450430 type:getChildren 1715452 type:getData 390 type:ping 3736 type:setData 36 type:setWatches 3512 type:sync 13235432 type:unknown ``` Then based on that found the following was the most frequently accessed path for the call to getChildren. ``` $ grep getChildren zookeeper_trace.log | grep -o -E "reqpath:[^ ]+" | sort | uniq -c | sort -r -n -k 1 | head 11201694 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate 134968 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/table_locks/2 13664 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/table_locks/+default 3748 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_6911b96289b3d855 3688 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_0b06908ef4e33636 3616 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_6b629f2b27214a8e 3188 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_46d92e738ee34ef3 2872 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_5e3af9674c1c87a8 2392 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_20b3516eb7a7c1d7 1920 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_776533e52e653bca ``` That led to opening #5728. I had run into problems in the manager w/ that code a year or two ago, did not realize it was placing such load on zookeeper in addition to the manager. Makes sense though. Seeing much lower CPU usage on the zookeeper server when running bulk randomwalk test w/ #5728. Did not enable trace logging again and look at the counts for get children though. Also found that zookeeper has audit logging, tried turning that on before the trace logging. Its useful, but only shows writes and in this case it was excessive reads that were causing the problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org