[ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
------------------------------
    Description: 
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not possess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
locks cannot resume running because there are already 10 jobs in running state, 
thus cannot proceed and release the locks. This is a deadlock that completely 
stuck the cluster.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348.

  was:
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not possess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
lock are not running thus cannot proceed and release them. This is a deadlock 
that completely stuck the cluster.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348.


> Deadlock in Kylin job execution
> -------------------------------
>
>                 Key: KYLIN-4689
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4689
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>            Reporter: Gabor Arki
>            Priority: Critical
>
> h4. Reproduction steps
>  * Install Kylin 3.1.0
>  * Deploy a streaming cube
>  * Enable the cube having historical data present in the Kafka topic
>  * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
> segments from Kafka when the cubes were enabled
> h4. Expected result
>  * Kylin is starting to process stream segments with stream jobs, eventually 
> processing the older segments and catching up with the stream
> h4. Actual result
>  * After a short time, all jobs are completely stuck without any progress. 
> Some in running state, some in pending state.
>  * The following logs are continously written:
> {code:java}
> 2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
> 12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
> 12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
> path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
> lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
> true,will try after one minute
> 2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - 
> There are too many jobs running, Job Fetch will wait until next schedule time
> {code}
>  * Zookeeper indicates the following locks are in place:
> {code:java}
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock
> [cube_cm, cube_vm, cube_jm]
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
> []
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
> []
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
> []
> ls /kylin/kylin_metadata/cube_job_lock
> [cube_cm, cube_vm, cube_jm]
> ls /kylin/kylin_metadata/cube_job_lock/cube_cm
> [f888380e-9ff4-98f5-2df4-1ae71e045f93]
> ls /kylin/kylin_metadata/cube_job_lock/cube_vm
> [fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
> ls /kylin/kylin_metadata/cube_job_lock/cube_jm
> [d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
> {code}
>  * The job IDs for the running jobs:
>  * 
>  ** 169f75fa-a02f-221b-fc48-037bc7a842d0
>  ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
>  ** 00924699-8b51-8091-6e71-34ccfeba3a98
>  ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
>  ** 416355c2-a3d7-57eb-55c6-c042aa256510
>  ** 12750aea-3b96-c817-64e8-bf893d8c120f
>  ** 42819dde-5857-fd6b-b075-439952f47140
>  ** 00128937-bd4a-d6c1-7a4e-744dee946f67
>  ** 46a0233f-217e-9155-725b-c815ad77ba2c
>  ** 062150ba-bacd-6644-4801-3a51b260d1c5
> As you can see, the 10 jobs that are actually running do not possess the 
> locks thus cannot actually do anything. On the other hand, the 3 jobs 
> possessing the locks cannot resume running because there are already 10 jobs 
> in running state, thus cannot proceed and release the locks. This is a 
> deadlock that completely stuck the cluster.
> We have been observing this behavior in 3.0.0 (where rolling back 
> https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
> in 3.1.0 as well. It has been originally reported in the comments of 
> https://issues.apache.org/jira/browse/KYLIN-4348.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to