Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/17872 )
Change subject: IMPALA-10811 RPC to submit query getting stuck for AWS NLB forever ...................................................................... Patch Set 19: (11 comments) http://gerrit.cloudera.org:8080/#/c/17872/15/be/src/service/client-request-state.cc File be/src/service/client-request-state.cc: http://gerrit.cloudera.org:8080/#/c/17872/15/be/src/service/client-request-state.cc@720 PS15, Line 720: : // Set the results to be reported to the client. : SetResultSet(catalog_op_executor_->ddl_exec_response()); : return Status::OK(); : } : : Status ClientRequestState::ExecAsyncDdlRequest() { : string op_type = catalog_op_type() == TCatalogOpType::DDL ? : > Decided to keep this particular case (CREATE_TABLE_AS_SELECT) to run (the e I hacked a bit on this, and the modifications are not that big. This is what that would look like: http://gerrit.cloudera.org:8080/17916 http://gerrit.cloudera.org:8080/#/c/17872/17/be/src/service/client-request-state.cc File be/src/service/client-request-state.cc: http://gerrit.cloudera.org:8080/#/c/17872/17/be/src/service/client-request-state.cc@643 PS17, Line 643: SyncDdl Nit: We have an unrelated feature called "sync_ddl". If we can avoid similar names, we should. One naming scheme would be: ExecDdlRequest() - entrypoint that picks between sync/async ShouldRunDdlAsync() - decision to go sync/async ExecDdlRequestInt() - synchronous version ExecDdlRequestIntAsync() - asynchronous version http://gerrit.cloudera.org:8080/#/c/17872/17/be/src/service/client-request-state.cc@688 PS17, Line 688: Status status = catalog_op_executor_->Exec(exec_request_->catalog_op_request); : { Same as other DebugAction location. Same fixes. http://gerrit.cloudera.org:8080/#/c/17872/17/be/src/service/client-request-state.cc@728 PS17, Line 728: For this async code, I think we are better off with a void return value, because nothing is actually consuming the return value. Status would be conveyed through UpdateQueryStatus(). http://gerrit.cloudera.org:8080/#/c/17872/17/be/src/service/client-request-state.cc@735 PS17, Line 735: Status status = catalog_op_executor_->Exec(exec_request_->catalog_op_request); : { For here, I think you can use DebugActionNoFail(), since we are using this to add sleeps rather than actually failing. Separately, a small nit: let's make the location strings similar to our other location strings in this file. Something like "CRS_BEFORE_CATALOG_OP_EXEC" http://gerrit.cloudera.org:8080/#/c/17872/17/be/src/service/client-request-state.cc@744 PS17, Line 744: exec_request_->query_options.sync_ddl)); : : // Set the results to be reported to the cli An important thing here is that nothing is actually using the return value of this function. So, I think we need to change this to use UpdateQueryStatus() Status catalog_update_status = parent_server_->ProcessCatalogUpdateResult(...) { lock_guard<mutex> l(lock_); RETURN_IF_ERROR(UpdateQueryStatus(catalog_update_status)); } http://gerrit.cloudera.org:8080/#/c/17872/17/be/src/service/client-request-state.cc@778 PS17, Line 778: } : UpdateNonErrorExecState(ExecState::RUNNING); Please go look at the comment that I made on the previous upload about ABORT_IF_ERROR. Separately, we don't need to be holding the lock when executing this. http://gerrit.cloudera.org:8080/#/c/17872/17/be/src/service/client-request-state.cc@781 PS17, Line 781: } Do this before spawning the thread (similar to how ExecAsyncQueryOrDmlRequest() does). http://gerrit.cloudera.org:8080/#/c/17872/19/testdata/workloads/functional-query/queries/QueryTest/alter-table-recover.test File testdata/workloads/functional-query/queries/QueryTest/alter-table-recover.test: http://gerrit.cloudera.org:8080/#/c/17872/19/testdata/workloads/functional-query/queries/QueryTest/alter-table-recover.test@7 PS19, Line 7: drop table if exists alltypes_clone; : create external table alltypes_clone like functional_parquet.alltypes : location '$FILESYSTEM_PREFIX/test-warehouse/alltypes_parquet'; : set debug_action="TIMED_WAIT_BEFORE_CATALOG_OP_EXEC:SLEEP@15000"; : alter table alltypes_clone recover partitions; Ideally, a test would fail if you ran it without the corresponding code changes. Timing changes are harder to test, because we aren't really changing the behavior of a SQL statement. If you were doing the individual exec / getoperationstatus calls, you could verify that the exec is fast even with TIMED_WAIT_BEFORE_CATALOG_OP_EXEC:SLEEP@15000. I think the HS2 tests can test this. This test seems to have useful pieces: https://github.com/apache/impala/blob/master/tests/hs2/test_hs2.py#L282 http://gerrit.cloudera.org:8080/#/c/17872/19/tests/metadata/test_ddl.py File tests/metadata/test_ddl.py: http://gerrit.cloudera.org:8080/#/c/17872/19/tests/metadata/test_ddl.py@913 PS19, Line 913: cls.ImpalaTestMatrix.add_dimension(create_client_protocol_dimension()) : cls.ImpalaTestMatrix.add_constraint(lambda v: : v.get_value('protocol') == 'hs2') Test dimensions run tests multiple times with different configurations. If you remove the add_constraint(), it would run this test with each of hs2, beeswax, and hs2-http. That seems reasonable, and then you don't need TestAlterTableRecoverWithBeeswax. http://gerrit.cloudera.org:8080/#/c/17872/19/tests/metadata/test_ddl.py@921 PS19, Line 921: multiple_impalad=self._use_multiple_impalad(vector) You can omit multiple_impalad. When sync_ddl=0, _use_multiple_impalad is false, and that is the default value. -- To view, visit http://gerrit.cloudera.org:8080/17872 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib57e86926a233ef13d27a9ec8d9c36d33a88a44e Gerrit-Change-Number: 17872 Gerrit-PatchSet: 19 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Comment-Date: Tue, 12 Oct 2021 02:23:08 +0000 Gerrit-HasComments: Yes
