GitHub user nicochen opened a pull request:

    https://github.com/apache/flink/pull/4472

    FLINK-7368: MetricStore makes cpu spin at 100%

    Flink's `MetricStore` is not thread-safe. multi-treads may acess java' 
hashmap inside `MetricStore` and can tirgger hashmap's infinte loop. 
    
    Recently I met the case that flink jobmanager consumed 100% cpu. A part of 
stacktrace is shown below. The full jstack is in the attachment.
    {code:java}
    "ForkJoinPool-1-worker-19" daemon prio=10 tid=0x00007fbdacac9800 nid=0x64c1 
runnable [0x00007fbd7d1c2000]
       java.lang.Thread.State: RUNNABLE
            at java.util.HashMap.put(HashMap.java:494)
            at 
org.apache.flink.runtime.webmonitor.metrics.MetricStore.addMetric(MetricStore.java:176)
            at 
org.apache.flink.runtime.webmonitor.metrics.MetricStore.add(MetricStore.java:121)
            at 
org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.addMetrics(MetricFetcher.java:198)
            at 
org.apache.flink.runtime.webmonitor.metrics.MetricFetcher.access$500(MetricFetcher.java:58)
            at 
org.apache.flink.runtime.webmonitor.metrics.MetricFetcher$4.onSuccess(MetricFetcher.java:188)
            at akka.dispatch.OnSuccess.internal(Future.scala:212)
            at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)
            at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)
            at 
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
            at 
scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
            at 
scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
            at 
scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
            at 
java.util.concurrent.ForkJoinTask$AdaptedRunnable.exec(ForkJoinTask.java:1265)
            at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:334)
            at 
java.util.concurrent.ForkJoinWorkerThread.execTask(ForkJoinWorkerThread.java:604)
            at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:784)
            at java.util.concurrent.ForkJoinPool.work(ForkJoinPool.java:646)
            at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:398)
    {code}
    
    There are 24 threads show same stacktrace as above to indicate they are 
spining at HashMap.put(HashMap.java:494) (I am using Java 1.7.0_6). Many posts 
indicate multi-threads accessing hashmap cause this problem and I reproduce the 
case as well. Even through `MetricFetcher` has a 10 seconds minimum inteverl 
between each metrics qurey, it still cannot guarntee query responses do not 
acess `MtricStore`'s hashmap concurrently. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nicochen/flink FLINK-7368

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4472
    
----
commit abfa571fbf99be4b98d8d690ed10df1440dd21d5
Author: nicochen2012 <16100...@cnsuning.com>
Date:   2017-08-04T03:21:49Z

    FLINK-7368: MetricStore makes cpu spin at 100%

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to