[ https://issues.apache.org/jira/browse/IMPALA-12997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836322#comment-17836322 ]
Michael Smith commented on IMPALA-12997: ---------------------------------------- [~stigahuang] any idea what to look at next? > test_query_log tests get stuck trying to write to the log > --------------------------------------------------------- > > Key: IMPALA-12997 > URL: https://issues.apache.org/jira/browse/IMPALA-12997 > Project: IMPALA > Issue Type: Bug > Components: Catalog > Affects Versions: Impala 4.4.0 > Reporter: Michael Smith > Priority: Major > > In some test runs, most tests under test_query_log will start to fail on > various conditions like > {code} > custom_cluster/test_query_log.py:452: in > test_query_log_table_query_select_mt_dop > "impala-server.completed-queries.written", 1, 60) > common/impala_service.py:144: in wait_for_metric_value > self.__metric_timeout_assert(metric_name, expected_value, timeout) > common/impala_service.py:213: in __metric_timeout_assert > assert 0, assert_string > E AssertionError: Metric impala-server.completed-queries.written did not > reach value 1 in 60s. > E Dumping debug webpages in JSON format... > E Dumped memz JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/memz.json > E Dumped metrics JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/metrics.json > E Dumped queries JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/queries.json > E Dumped sessions JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/sessions.json > E Dumped threadz JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/threadz.json > E Dumped rpcz JSON to > $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/rpcz.json > E Dumping minidumps for impalads/catalogds... > E Dumped minidump for Impalad PID 3680802 > E Dumped minidump for Impalad PID 3680805 > E Dumped minidump for Impalad PID 3680809 > E Dumped minidump for Catalogd PID 3680732 > {code} > or > {code} > custom_cluster/test_query_log.py:921: in test_query_log_ignored_sqls > assert len(sql_results.data) == 1, "query not found in completed queries > table" > E AssertionError: query not found in completed queries table > E assert 0 == 1 > E + where 0 = len([]) > E + where [] = <tests.common.impala_connection.ImpylaHS2ResultSet > object at 0xffffa00cc350>.data > {code} > One symptom that seems related to this is INSERT operations into > sys.impala_query_log that start "UnregisterQuery()" but never finish (with > "Query successfully unregistered"). > We can identify cases like that with > {code} > for log in $(ag -l 'INSERT INTO sys.impala_query_log' impalad.*); do echo > $log; for qid in $(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO > sys.impala_query_log' $log | cut -d']' -f1); do if ! ag "Query successfully > unregistered: query_id=$qid" $log; then echo "$qid not unregistered"; fi; > done; done > {code} > A similar case may occur with creating the table too > {code} > for log in $(ag -l 'CREATE TABLE IF NOT EXISTS sys.impala_query_log' > impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-0a5a.vpc.cloudera.com.jenkins.log.INFO.20240410-*); > do QID=$(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO > sys.impala_query_log' $log | cut -d']' -f1); echo $log; ag "Query > successfully unregistered: query_id=$QID" $log; done > {code} > although these frequently fail because the test completes and shuts down > Impala before the CREATE TABLE query completes. > Tracking one of those cases led to catalogd errors that repeated for 1m27s > before the test suite restarted catalogd: > {code} > W0410 12:48:05.051760 3681790 Tasks.java:456] > 6647229faf7637d5:3ec7565b00000000] Retrying task after failure: Waiting for > lock on table sys.impala_query_log > Java exception follows: > org.apache.iceberg.hive.MetastoreLock$WaitingForLockException: Waiting for > lock on table sys.impala_query_log > at > org.apache.iceberg.hive.MetastoreLock.lambda$acquireLock$1(MetastoreLock.java:217) > at > org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413) > at > org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219) > at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203) > at > org.apache.iceberg.hive.MetastoreLock.acquireLock(MetastoreLock.java:209) > at org.apache.iceberg.hive.MetastoreLock.lock(MetastoreLock.java:146) > at > org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:194) > at > org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135) > at > org.apache.iceberg.BaseTransaction.lambda$commitSimpleTransaction$3(BaseTransaction.java:417) > at > org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413) > at > org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219) > at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203) > at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196) > at > org.apache.iceberg.BaseTransaction.commitSimpleTransaction(BaseTransaction.java:413) > at > org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:308) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7220) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:6939) > at > org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259) > at > org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503) > {code} > This seems to correspond to the first test failure and persisted through all > tests afterwards. So this lock was likely acquired from HMS, then never > released. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org