[jira] [Commented] (IMPALA-12997) test_query_log tests get stuck trying to write to the log

ASF subversion and git services (Jira) Fri, 26 Apr 2024 21:11:06 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841396#comment-17841396
 ]


ASF subversion and git services commented on IMPALA-12997:
----------------------------------------------------------

Commit 712a37bce461c233bc9e648550e82696bef0f224 in impala's branch 
refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=712a37bce ]

IMPALA-12997: Use graceful shutdown for query log tests

Uses graceful shutdown for all tests that might insert into
'sys.impala_query_log' to avoid leaving the table locked in HMS by a
SIGTERM. That's primarily any test that lowers
'query_log_write_interval_s' or 'query_log_max_queued'.

Lowers grace period on test_query_log_table_flush_on_shutdown because
ShutdownWorkloadManagement() is not started until the grace period ends.

Updates "Adding/Removing local backend" to only apply to executors. It
was only added for executors, but would be removed on dedicated
coordinators as well (resulting in a DFATAL message during graceful
shutdown).

Change-Id: Ia123c53a952a77ff4a9c02736b5717ccaa3566dc
Reviewed-on: http://gerrit.cloudera.org:8080/21345
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> test_query_log tests get stuck trying to write to the log
> ---------------------------------------------------------
>
>                 Key: IMPALA-12997
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12997
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 4.4.0
>            Reporter: Michael Smith
>            Assignee: Michael Smith
>            Priority: Major
>
> In some test runs, most tests under test_query_log will start to fail on 
> various conditions like
> {code}
> custom_cluster/test_query_log.py:452: in 
> test_query_log_table_query_select_mt_dop
>     "impala-server.completed-queries.written", 1, 60)
> common/impala_service.py:144: in wait_for_metric_value
>     self.__metric_timeout_assert(metric_name, expected_value, timeout)
> common/impala_service.py:213: in __metric_timeout_assert
>     assert 0, assert_string
> E   AssertionError: Metric impala-server.completed-queries.written did not 
> reach value 1 in 60s.
> E   Dumping debug webpages in JSON format...
> E   Dumped memz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/memz.json
> E   Dumped metrics JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/metrics.json
> E   Dumped queries JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/queries.json
> E   Dumped sessions JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/sessions.json
> E   Dumped threadz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/threadz.json
> E   Dumped rpcz JSON to 
> $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/rpcz.json
> E   Dumping minidumps for impalads/catalogds...
> E   Dumped minidump for Impalad PID 3680802
> E   Dumped minidump for Impalad PID 3680805
> E   Dumped minidump for Impalad PID 3680809
> E   Dumped minidump for Catalogd PID 3680732
> {code}
> or
> {code}
> custom_cluster/test_query_log.py:921: in test_query_log_ignored_sqls
>     assert len(sql_results.data) == 1, "query not found in completed queries 
> table"
> E   AssertionError: query not found in completed queries table
> E   assert 0 == 1
> E    +  where 0 = len([])
> E    +    where [] = <tests.common.impala_connection.ImpylaHS2ResultSet 
> object at 0xffffa00cc350>.data
> {code}
> One symptom that seems related to this is INSERT operations into 
> sys.impala_query_log that start "UnregisterQuery()" but never finish (with 
> "Query successfully unregistered").
> We can identify cases like that with
> {code}
> for log in $(ag -l 'INSERT INTO sys.impala_query_log' impalad.*); do echo 
> $log; for qid in $(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO 
> sys.impala_query_log' $log | cut -d']' -f1); do if ! ag "Query successfully 
> unregistered: query_id=$qid" $log; then echo "$qid not unregistered"; fi; 
> done; done
> {code}
> A similar case may occur with creating the table too
> {code}
> for log in $(ag -l 'CREATE TABLE IF NOT EXISTS sys.impala_query_log' 
> impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-0a5a.vpc.cloudera.com.jenkins.log.INFO.20240410-*);
>  do QID=$(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO 
> sys.impala_query_log' $log | cut -d']' -f1); echo $log; ag "Query 
> successfully unregistered: query_id=$QID" $log; done
> {code}
> although these frequently fail because the test completes and shuts down 
> Impala before the CREATE TABLE query completes.
> Tracking one of those cases led to catalogd errors that repeated for 1m27s 
> before the test suite restarted catalogd:
> {code}
> W0410 12:48:05.051760 3681790 Tasks.java:456] 
> 6647229faf7637d5:3ec7565b00000000] Retrying task after failure: Waiting for 
> lock on table sys.impala_query_log
> Java exception follows:
> org.apache.iceberg.hive.MetastoreLock$WaitingForLockException: Waiting for 
> lock on table sys.impala_query_log
>       at 
> org.apache.iceberg.hive.MetastoreLock.lambda$acquireLock$1(MetastoreLock.java:217)
>       at 
> org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
>       at 
> org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
>       at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
>       at 
> org.apache.iceberg.hive.MetastoreLock.acquireLock(MetastoreLock.java:209)
>       at org.apache.iceberg.hive.MetastoreLock.lock(MetastoreLock.java:146)
>       at 
> org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:194)
>       at 
> org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135)
>       at 
> org.apache.iceberg.BaseTransaction.lambda$commitSimpleTransaction$3(BaseTransaction.java:417)
>       at 
> org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
>       at 
> org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
>       at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
>       at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
>       at 
> org.apache.iceberg.BaseTransaction.commitSimpleTransaction(BaseTransaction.java:413)
>       at 
> org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:308)
>       at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7220)
>       at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:6939)
>       at 
> org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504)
>       at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>       at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>       at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>       at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
>       at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245)
>       at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259)
>       at 
> org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
> {code}
> This seems to correspond to the first test failure and persisted through all 
> tests afterwards. So this lock was likely acquired from HMS, then never 
> released.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-12997) test_query_log tests get stuck trying to write to the log

Reply via email to