[ 
https://issues.apache.org/jira/browse/CASSANDRA-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangzhou xia updated CASSANDRA-13756:
--------------------------------------
    Description: 
When we test C*3 in shadow cluster, we notice after a period of time, several 
data node suddenly run into 100% cpu and stop process query anymore.

After investigation, we found that threads are stuck on the sum() in 
streaminghistogram class. Those are jmx threads that working on expose 
getTombStoneRatio metrics (since jmx is kicked off every 3 seconds, there is a 
chance that multiple jmx thread is access streaminghistogram at the same time). 
 

After further investigation, we find that the optimization in CASSANDRA-13038 
led to a spool flush every time when we call sum(). Since TreeMap is not thread 
safe, threads will be stuck when multiple threads visit sum() at the same time.

There are two approaches to solve this issue. 

The first one is to add a lock to the flush in sum() which will introduce some 
extra overhead to streaminghistogram.

The second one is to avoid streaminghistogram to be access by multiple threads. 
For our specific case, is to remove the metrics we added.  

  was:
optimization in CASSANDRA-13038 led to a spool flush every time when we call 
sum. Since TreeMap is not thread safe, threads will be stuck when multiple 
threads visit sum() at the same time, and finally 100% cpu is stuck in that 
function. 

I think this issue is not limit to sum(), update() and merge() both have the 
same issue since they all need to update TreeMap. 

Add lock to bin solved this issue but it also introduced extra overhead.


> StreamingHistogram is not thread safe
> -------------------------------------
>
>                 Key: CASSANDRA-13756
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13756
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: xiangzhou xia
>
> When we test C*3 in shadow cluster, we notice after a period of time, several 
> data node suddenly run into 100% cpu and stop process query anymore.
> After investigation, we found that threads are stuck on the sum() in 
> streaminghistogram class. Those are jmx threads that working on expose 
> getTombStoneRatio metrics (since jmx is kicked off every 3 seconds, there is 
> a chance that multiple jmx thread is access streaminghistogram at the same 
> time).  
> After further investigation, we find that the optimization in CASSANDRA-13038 
> led to a spool flush every time when we call sum(). Since TreeMap is not 
> thread safe, threads will be stuck when multiple threads visit sum() at the 
> same time.
> There are two approaches to solve this issue. 
> The first one is to add a lock to the flush in sum() which will introduce 
> some extra overhead to streaminghistogram.
> The second one is to avoid streaminghistogram to be access by multiple 
> threads. For our specific case, is to remove the metrics we added.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to