zhong.zhu created KYLIN-5745: -------------------------------- Summary: The historical garbage cleanup task was not completed, causing the subsequent scheduled garbage cleanup task cannot be executed normally Key: KYLIN-5745 URL: https://issues.apache.org/jira/browse/KYLIN-5745 Project: Kylin Issue Type: Bug Affects Versions: 5.0-beta Reporter: zhong.zhu Assignee: zhong.zhu Fix For: 5.0.0
{*}Problem description{*}: Timed garbage cleanup operation cannot be completed successfully {*}Background{*}: The customer found that Kylin has a large number of small files occupying hdfs storage, we need to clean up, we check the customer's environment and found that the timed garbage cleanup has not been completed properly, has been timeout! *Troubleshooting:* After the check, it is found that the customer's garbage clearing is triggered for the first time in the morning of 4.6 after KE is restarted on the night of 4.5. After this clearing operation is triggered, the thread of query history has been deleted since then. As a result, subsequent periodic garbage clearing tasks cannot be completed Delete 2,000 rows of data at a time, one of the customer's projects need to delete 550,000 query history, look at the kylin.log record, delete time-consuming because of table locking problems lead to a delete operation even reached more than 20 minutes! The following record is that the main thread of garbage collection is waiting for the query history cleaning to complete, but the query history cleaning has not been completed, and then the main thread timeout and exit. {code:shell} 2023-04-06T00:00:00,015 INFO [RoutineOpsWorker-287] service.ScheduleService : execute task MetadataBackup with remaining time: 14399995 ms 2023-04-06T00:01:52,649 INFO [RoutineOpsWorker-287] service.ScheduleService : execute task QueryHistoriesCleanup with remaining time: 14287361 ms ... 2023-04-06T04:00:00,012 WARN [DefaultTaskScheduler-3] service.ScheduleService : Routine task execution timeout java.util.concurrent.TimeoutException: null at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_242] at org.apache.kylin.rest.service.ScheduleService.executeTask(ScheduleService.java:107) ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?] at org.apache.kylin.rest.service.ScheduleService.routineTask(ScheduleService.java:77) ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?] at org.apache.kylin.rest.service.ScheduleService$$FastClassBySpringCGLIB$$afbfc46c.invoke(<generated>) ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?] {code} The following record is until the latest time provided by the log, after 9:00 pm the query history is still processing deletion, not with the termination of the main thread {code:shell} 2023-04-06T00:08:43,015 DEBUG [QueryHistoryCleanWorker-23145] QueryHistoryMapper.selectByProject : <== Total: 12 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project<CPIC_FRP> is less than the maximum limit, so skip it. 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project<CXAIMA> is less than the maximum limit, so skip it. 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project<CXCDC> is less than the maximum limit, so skip it. 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project<CXCRMS> is less than the maximum limit, so skip it. 2023-04-06T00:08:43,017 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Start to delete query histories that are beyond max size for project<CXCZH>, records:1551669 ... 2023-04-06T09:03:54,974 INFO [QueryHistoryCleanWorker-23145] query.JdbcQueryHistoryStore : Delete 2000 row query history for project [CXCZH] takes 938060 ms 2023-04-06T09:03:54,975 DEBUG [QueryHistoryCleanWorker-23145] QueryHistoryMapper.delete : ==> Preparing: delete from ke4_instance_query_history_realization where query_time < ? and project_name = ? 2023-04-06T09:03:54,975 DEBUG [QueryHistoryCleanWorker-23145] QueryHistoryMapper.delete : ==> Parameters: 1678863450091(Long), CXCZH(String) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)