[ 
https://issues.apache.org/jira/browse/KYLIN-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shao Feng Shi resolved KYLIN-4432.
----------------------------------
    Resolution: Fixed

> duplicated queries with sytax error take unexpect long time when lazy query 
> enabled
> -----------------------------------------------------------------------------------
>
>                 Key: KYLIN-4432
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4432
>             Project: Kylin
>          Issue Type: Bug
>            Reporter: Congling Xia
>            Assignee: Congling Xia
>            Priority: Major
>             Fix For: v3.1.0
>
>
> Hi! Our Kylin server becomes unresponsive recently. All query threads were 
> being TIMED_WAITING so that no more thread in the thread pool of Tomcat could 
> response new requests. The server logs are full of "Duplicate SQL request"s, 
> and many requests retrying for so long time:
> {code:java}
> $ cat kylin.log.25 | grep 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991 | head 
> -3
> 2020-03-23 12:15:32,279 INFO [Query 
> 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
> Duplicated SQL request is running, waiting...
> 2020-03-23 12:15:32,379 INFO [Query 
> 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
> Duplicated SQL request is running, waiting...
> 2020-03-23 12:15:32,479 INFO [Query 
> 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
> Duplicated SQL request is running, waiting...
> $ cat kylin.log.25 | grep 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991 | tail 
> -3
> 2020-03-23 12:26:56,920 INFO [Query 
> 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
> Duplicated SQL request is running, waiting...
> 2020-03-23 12:26:57,020 INFO [Query 
> 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
> Duplicated SQL request is running, waiting...
> 2020-03-23 12:26:57,121 INFO [Query 
> 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
> Duplicated SQL request is running, waiting...
> {code}
> This can be re-produced easily:
>  * enable lazy query (KYLIN-2897) by setting environment variable 
> kylin.query.cache-enabled=true and kylin.query.lazy-query-enabled=true
>  * send the same query with syntax error many times in several minutes
> The first query will quickly response with error message, and the following 
> queries will executing for long time and will repeatly output "Duplicated SQL 
> request is running, waiting...".
> The code in org.apache.kylin.rest.service.QueryService#searchQueryInCache 
> indicates duplicated query will wait at most one minute with default setting 
> of LazyQueryWaitingTimeoutMilliSeconds but in fact it waits quite longer than 
> that.
> {code:java}
> while (response.isRunning()) {
>     // Wait at most one minute
>     if (System.currentTimeMillis() - response.getLazyQueryStartTime() >= 
> getConfig()
>             .getLazyQueryWaitingTimeoutMilliSeconds()) {
>         cache.evict(sqlRequest.getCacheKey());
>         return null;
>     }
>     logger.info("Duplicated SQL request is running, waiting...");
>     try {
>         Thread.sleep(100L);
>     } catch (InterruptedException e) {
>     }
>     wrapper = cache.get(sqlRequest.getCacheKey());
>     if (wrapper == null) {
>         return null;
>     }
>     response = (SQLResponse) wrapper.get();
>     if (response == null) {
>         return null;
>     }
> }
> {code}
> After some work-around, I find that the dummy response status remains 
> unchanged after execution with exception. So the 2nd query will wait 
> LazyQueryWaitingTimeoutMilliSeconds, evict the dummy response created by the 
> 1st query from cache, and put a new dummy response into the cache. And the 
> 3rd query may use the new dummy response's start time to compute timeout, and 
> will wait for another LazyQueryWaitingTimeoutMilliSeconds.
> Suppose we send _n_ identical bad queries to the server quickly. The last 
> query may wait for nearly _(n-1)_ times of 
> LazyQueryWaitingTimeoutMilliSeconds. That's why the Web container's thread 
> pool was exhausted – user program uses bad queries to check availability of 
> Kylin service and retries many times in a short time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to