[ https://issues.apache.org/jira/browse/KYLIN-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335353#comment-15335353 ]
liyang commented on KYLIN-1523: ------------------------------- In the dump, many threads stay (hang?) at this point. A HashMap being accessed concurrently. I believe a later version of calcite has fixed this issue, because as of Kylin 1.5 + calcite 1.6.0, the HashMap has been replaced by a ConcurrentMap. Believe upgrade to Kylin 1.5 will solve this problem. {code} "http-bio-7070-exec-273" daemon prio=10 tid=0x00007f86799c1800 nid=0x4f7d runnable [0x00007f76f63c8000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.getEntry(HashMap.java:465) at java.util.HashMap.get(HashMap.java:417) at org.apache.calcite.rel.metadata.ReflectiveRelMetadataProvider.apply(ReflectiveRelMetadataProvider.java:251) at org.apache.calcite.rel.metadata.ChainedRelMetadataProvider.apply(ChainedRelMetadataProvider.java:60) at org.apache.calcite.rel.metadata.MetadataFactoryImpl$1.load(MetadataFactoryImpl.java:56) at org.apache.calcite.rel.metadata.MetadataFactoryImpl$1.load(MetadataFactoryImpl.java:53) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3579) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2372) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2335) - locked <0x00007f77f9d94458> (a com.google.common.cache.LocalCache$StrongEntry) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2250) at com.google.common.cache.LocalCache.get(LocalCache.java:3980) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3984) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4868) at org.apache.calcite.rel.metadata.MetadataFactoryImpl.query(MetadataFactoryImpl.java:69) at org.apache.calcite.rel.AbstractRelNode.metadata(AbstractRelNode.java:271) at org.apache.calcite.rel.metadata.RelMetadataQuery.getRowCount(RelMetadataQuery.java:84) at org.apache.calcite.adapter.enumerable.EnumerableJoin.computeSelfCost(EnumerableJoin.java:79) at org.apache.kylin.query.relnode.OLAPJoinRel.computeSelfCost(OLAPJoinRel.java:99) {code} > Kylin server hang when many bad query running > --------------------------------------------- > > Key: KYLIN-1523 > URL: https://issues.apache.org/jira/browse/KYLIN-1523 > Project: Kylin > Issue Type: Bug > Components: Query Engine > Affects Versions: v1.4.0 > Reporter: qianqiaoneng > Assignee: liyang > Priority: Critical > Attachments: kylin-425.jstack > > > When there are some bad query running, kylin server will be exhausted then > crashed. Better to have some control on the resource that bad query consumer, > or kill the bad query with some timeout. Should not let them bring the server > down. > The root reason is there are too many slow queries like below exhaust kylin: > 2016-03-21 14:28:33,487 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3557 seconds (thread id 0x1841) – > 2016-03-21 14:28:33,489 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3557 seconds (thread id 0x1840) – > 2016-03-21 14:28:33,490 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3556 seconds (thread id 0x1842) – > 2016-03-21 14:28:33,491 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3556 seconds (thread id 0x1843) – > 2016-03-21 14:28:33,493 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3556 seconds (thread id 0x1844) – > 2016-03-21 14:28:33,494 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3553 seconds (thread id 0x1845) – > 2016-03-21 14:28:33,495 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3553 seconds (thread id 0x1847) – > 2016-03-21 14:28:33,509 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3552 seconds (thread id 0x1848) – > 2016-03-21 14:28:33,512 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3551 seconds (thread id 0x184a) – > 2016-03-21 14:28:33,525 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3510 seconds (thread id 0x184c) – > 2016-03-21 14:28:33,554 INFO [BadQueryDetector] service.BadQueryDetector:57 : > Slow query has been running 3510 seconds (thread id 0x184d) – -- This message was sent by Atlassian JIRA (v6.3.4#6332)