keith-turner commented on issue #5712:
URL: https://github.com/apache/accumulo/issues/5712#issuecomment-3050635696

   > I also ran the bulk random walk test with Accumulo version 2.1.4rc0 and 
Zookeeper version 3.9.2, and saw zookeeper's CPU usage fairly high across the 
whole of the test.
   
   That was helpful, made me realize it was a more general problem than just 
after things got stuck.  
   
   Looked into this some more and enabled zookeeper server trace logging while 
running the test.  Looking at the ZK trace logs, saw that getChildren was the 
most frequent operation.
   
   ```
   $ grep reqpath zookeeper_trace.log | grep -o -E "type:[a-zA-Z]+" | sort | 
uniq -c
     10605 type:create
      8429 type:delete
     59636 type:exists
   11450430 type:getChildren
   1715452 type:getData
       390 type:ping
      3736 type:setData
        36 type:setWatches
      3512 type:sync
   13235432 type:unknown
   ```
   
   Then based on that found the following was the most frequently accessed path 
for the call to getChildren.
   
   ```
   $ grep getChildren zookeeper_trace.log | grep -o -E "reqpath:[^ ]+"  | sort 
| uniq -c | sort -r -n -k 1 | head
   11201694 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate
    134968 reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/table_locks/2
     13664 
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/table_locks/+default
      3748 
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_6911b96289b3d855
      3688 
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_0b06908ef4e33636
      3616 
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_6b629f2b27214a8e
      3188 
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_46d92e738ee34ef3
      2872 
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_5e3af9674c1c87a8
      2392 
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_20b3516eb7a7c1d7
      1920 
reqpath:/accumulo/f4e97c86-c07b-484f-9214-3a7fe8687c98/fate/tx_776533e52e653bca
   ```
   
   That led to opening #5728.  I had run into problems in the manager w/ that 
code a year or two ago, did not realize it was placing such load on zookeeper 
in addition to the manager.  Makes sense though.  Seeing much lower CPU usage 
on the zookeeper server when running bulk randomwalk test w/ #5728. Did not 
enable trace logging again and look at the counts for get children though.
   
   Also found that zookeeper has audit logging, tried turning that on before 
the trace logging. Its useful, but only shows writes and in this case it was 
excessive reads that were causing the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to