[ https://issues.apache.org/jira/browse/IMPALA-12997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841396#comment-17841396 ]
ASF subversion and git services commented on IMPALA-12997: ---------------------------------------------------------- Commit 712a37bce461c233bc9e648550e82696bef0f224 in impala's branch refs/heads/master from Michael Smith [ https://gitbox.apache.org/repos/asf?p=impala.git;h=712a37bce ] IMPALA-12997: Use graceful shutdown for query log tests Uses graceful shutdown for all tests that might insert into 'sys.impala_query_log' to avoid leaving the table locked in HMS by a SIGTERM. That's primarily any test that lowers 'query_log_write_interval_s' or 'query_log_max_queued'. Lowers grace period on test_query_log_table_flush_on_shutdown because ShutdownWorkloadManagement() is not started until the grace period ends. Updates "Adding/Removing local backend" to only apply to executors. It was only added for executors, but would be removed on dedicated coordinators as well (resulting in a DFATAL message during graceful shutdown). Change-Id: Ia123c53a952a77ff4a9c02736b5717ccaa3566dc Reviewed-on: http://gerrit.cloudera.org:8080/21345 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > test_query_log tests get stuck trying to write to the log > --------------------------------------------------------- > > Key: IMPALA-12997 > URL: https://issues.apache.org/jira/browse/IMPALA-12997 > Project: IMPALA > Issue Type: Bug > Components: Catalog > Affects Versions: Impala 4.4.0 > Reporter: Michael Smith > Assignee: Michael Smith > Priority: Major > > In some test runs, most tests under test_query_log will start to fail on > various conditions like > {code} > custom_cluster/test_query_log.py:452: in > test_query_log_table_query_select_mt_dop > "impala-server.completed-queries.written", 1, 60) > common/impala_service.py:144: in wait_for_metric_value > self.__metric_timeout_assert(metric_name, expected_value, timeout) > common/impala_service.py:213: in __metric_timeout_assert > assert 0, assert_string > E AssertionError: Metric impala-server.completed-queries.written did not > reach value 1 in 60s. > E Dumping debug webpages in JSON format... > E Dumped memz JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/memz.json > E Dumped metrics JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/metrics.json > E Dumped queries JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/queries.json > E Dumped sessions JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/sessions.json > E Dumped threadz JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/threadz.json > E Dumped rpcz JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/rpcz.json > E Dumping minidumps for impalads/catalogds... > E Dumped minidump for Impalad PID 3680802 > E Dumped minidump for Impalad PID 3680805 > E Dumped minidump for Impalad PID 3680809 > E Dumped minidump for Catalogd PID 3680732 > {code} > or > {code} > custom_cluster/test_query_log.py:921: in test_query_log_ignored_sqls > assert len(sql_results.data) == 1, "query not found in completed queries > table" > E AssertionError: query not found in completed queries table > E assert 0 == 1 > E + where 0 = len([]) > E + where [] = <tests.common.impala_connection.ImpylaHS2ResultSet > object at 0xffffa00cc350>.data > {code} > One symptom that seems related to this is INSERT operations into > sys.impala_query_log that start "UnregisterQuery()" but never finish (with > "Query successfully unregistered"). > We can identify cases like that with > {code} > for log in $(ag -l 'INSERT INTO sys.impala_query_log' impalad.*); do echo > $log; for qid in $(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO > sys.impala_query_log' $log | cut -d']' -f1); do if ! ag "Query successfully > unregistered: query_id=$qid" $log; then echo "$qid not unregistered"; fi; > done; done > {code} > A similar case may occur with creating the table too > {code} > for log in $(ag -l 'CREATE TABLE IF NOT EXISTS sys.impala_query_log' > impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-0a5a.vpc.cloudera.com.jenkins.log.INFO.20240410-*); > do QID=$(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO > sys.impala_query_log' $log | cut -d']' -f1); echo $log; ag "Query > successfully unregistered: query_id=$QID" $log; done > {code} > although these frequently fail because the test completes and shuts down > Impala before the CREATE TABLE query completes. > Tracking one of those cases led to catalogd errors that repeated for 1m27s > before the test suite restarted catalogd: > {code} > W0410 12:48:05.051760 3681790 Tasks.java:456] > 6647229faf7637d5:3ec7565b00000000] Retrying task after failure: Waiting for > lock on table sys.impala_query_log > Java exception follows: > org.apache.iceberg.hive.MetastoreLock$WaitingForLockException: Waiting for > lock on table sys.impala_query_log > at > org.apache.iceberg.hive.MetastoreLock.lambda$acquireLock$1(MetastoreLock.java:217) > at > org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413) > at > org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219) > at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203) > at > org.apache.iceberg.hive.MetastoreLock.acquireLock(MetastoreLock.java:209) > at org.apache.iceberg.hive.MetastoreLock.lock(MetastoreLock.java:146) > at > org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:194) > at > org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135) > at > org.apache.iceberg.BaseTransaction.lambda$commitSimpleTransaction$3(BaseTransaction.java:417) > at > org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413) > at > org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219) > at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203) > at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196) > at > org.apache.iceberg.BaseTransaction.commitSimpleTransaction(BaseTransaction.java:413) > at > org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:308) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7220) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:6939) > at > org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259) > at > org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503) > {code} > This seems to correspond to the first test failure and persisted through all > tests afterwards. So this lock was likely acquired from HMS, then never > released. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org