Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16952 )
Change subject: KUDU-2612: background task to commit transaction ...................................................................... Patch Set 8: (11 comments) http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc File src/kudu/integration-tests/txn_commit-itest.cc: http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@302 PS7, Line 302: DeleteTable > Does it work as expected as well if dropping a tablet (just a part of a tab Done http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@316 PS7, Line 316: TestCommitAfterDroppingRangeP > I think it would be great to add a scenario when TxnStatusManager instance Yeah, I can add that as a follow-up if you dont mind it being in a separate patch. Pausing is tricky with my current test harnesses, so I'd like to add some additional coverage using an EMC, per your other suggestions. http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@565 PS7, Line 565: for (const auto& s : results) { : EXPECT_OK(s); : } > Is this going to put the txn keepalive thread to sleep as well? I'm not su Ah indeed. Done http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@690 PS7, Line 690: // Shut down our par > I guess a real 'abrupt' way of terminating the process is sending SIGKILL : Yeah, calling Shutdown() from the mini server helps exercise any potential races in the destructor, but SIGKILL would be more organic. I'll add that as a follow-up change. http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@754 PS7, Line 754: } > It would be great to add at least one failure scenario, when finalizing the That case is very similar to the TestCommitWhenParticipantsAreDown case, though less trivial to implement with 1, 2, and 3 node clusters. I can add that as a follow-up. http://gerrit.cloudera.org:8080/#/c/16952/4/src/kudu/transactions/txn_status_manager.cc File src/kudu/transactions/txn_status_manager.cc: http://gerrit.cloudera.org:8080/#/c/16952/4/src/kudu/transactions/txn_status_manager.cc@152 PS4, Line 152: > Ah, indeed. It seems I misunderstood where the status comes from (even if Added a comment http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc File src/kudu/transactions/txn_status_manager.cc: http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@201 PS7, Line 201: if (PREDICT_FALSE(s.IsTimedOut())) { : // Retry timeout errors. Other transient errors should be retried by the : // client until timeout. > After some consideration, I'm not quite sure this 'good-riddance' semantic The case we're thinking about here is if a transaction is racing with a drop partition. The observation I'm relying on here is that the result of deleting the partition after committing and deleting the partition before committing are the same -- the end result is the partition is deleted with no way of reading the data. And if the table were deleted while inserting data, it's the onus of the writing application to detect that, since there'd be NotFound errors well before the commit phase. It's also important to pass through this case rather than aborting for the following case: 1. begin committing a transaction with participants 1..5, running the BEGIN_COMMIT op on all participants 2. finalize commit on participants 1..5, running the FINALIZE_COMMIT op on all participants 3. leadership changes or the node crashes, a new leader is elected and the commit task is retried 4. partition 5 is deleted 5. we notice that partition 5 is deleted. Should we abort the transaction? But we've already finalized the commit on some participants! I added a comment trying to explain that. http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@286 PS7, Line 286: txn_client_->ParticipateInTransactionAsync( : participant_ids_[participant_idx], > nit: just for the sake of working with std::atomic in a consistent way, may Done http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@463 PS7, Line 463: > Should this be a parameter controlled by a run-time flag? Done http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@471 PS7, Line 471: now, just > nit: maybe, move this outside of the 'for ()' cycle? Done http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@492 PS7, Line 492: { : std::lock_guard<simple_spinlock> l(lock_); : highest_txn_id_ = std::max(highest_txn_id, highest_txn_id_); : > Why not to do so prior to starting new commit tasks? I guess it might be l Done -- To view, visit http://gerrit.cloudera.org:8080/16952 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie2258dded3ab3d527cb5d0abdc7d5e7deb4da15e Gerrit-Change-Number: 16952 Gerrit-PatchSet: 8 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Tue, 02 Feb 2021 20:28:57 +0000 Gerrit-HasComments: Yes
