[ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162898#comment-17162898
 ] 

Gabor Arki edited comment on KYLIN-4500 at 7/22/20, 4:21 PM:
-------------------------------------------------------------

[~hit_lacus], the linked issue seems to be unrelated. But if the dictionary is 
stored on the cluster, it could be related, however I do not see any 
FileNotFoundException logs when we are hitting this issue. I do see the slow 
ramp-up in CLOSE_WAIT connections though on the server.

We are running Kylin on AWS EMR cluster and use S3 (EMRFS) for data storage 
instead of HDFS to make the cluster stateless. However, after some continuous 
uptime, we are always facing this issue where both the query server and the 
Kylin MR jobs are suddenly failing with the aforementioned Exception. The root 
cause of these failures is that the connection pool of the EMR cluster to S3 is 
exhausted and new operations fail to acquire a connection and time out while 
waiting for an S3 connection.

No matter how much of a pool size we are configuring for the 
fs.s3.maxConnections value, this keeps happening. The underlying issue is very 
likely a connection leak where some code is not properly closing and returning 
a connection to the pool. Given a query server restart is solving the issue, I 
suspect the pool is exhausted somewhere in the Kylin query server code.


was (Author: arkigabor):
[~hit_lacus], the linked issue seems to be unrelated.

We are running Kylin on AWS EMR cluster and use S3 (EMRFS) for data storage 
instead of HDFS to make the cluster stateless. However, after some continuous 
uptime, we are always facing this issue where both the query server and the 
Kylin MR jobs are suddenly failing with the aforementioned Exception. The root 
cause of these failures is that the connection pool of the EMR cluster to S3 is 
exhausted and new operations fail to acquire a connection and time out while 
waiting for an S3 connection.

No matter how much of a pool size we are configuring for the 
fs.s3.maxConnections value, this keeps happening. The underlying issue is very 
likely a connection leak where some code is not properly closing and returning 
a connection to the pool. Given a query server restart is solving the issue, I 
suspect the pool is exhausted somewhere in the Kylin query server code.

> Timeout waiting for connection from pool
> ----------------------------------------
>
>                 Key: KYLIN-4500
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4500
>             Project: Kylin
>          Issue Type: Bug
>            Reporter: Gabor Arki
>            Priority: Major
>         Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
>         at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 10000 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to