AntiTopQuark commented on PR #61199:
URL: https://github.com/apache/doris/pull/61199#issuecomment-4163380068

   > ## Code Review
   > Thanks for this substantial feature addition! The TSO design follows 
established patterns (similar to TiDB/PD). Here are my findings:
   > 
   > ### Critical Issues
   > **1. Config naming inconsistency — `enable_tso_feature` vs 
`experimental_enable_feature_tso`**
   > 
   > The Config field is `enable_tso_feature` but the error message in 
`PropertyAnalyzer.analyzeEnableTso()` says `"experimental_enable_tso_feature"` 
(note the word order difference: `enable_tso_feature` vs `enable_feature_tso`). 
This will confuse users trying to set the right config.
   > 
   > **2. `getTSO()` sleeps in the transaction commit path**
   > 
   > `getTSO()` is called from `DatabaseTransactionMgr.getCommitTSO()` during 
commit. It contains retry loops that call `sleep(200)` up to 
`tso_max_get_retry_count` (default 10) times. This means a **transaction commit 
can block for up to 2 seconds** waiting for the environment. The 
`Env.getCurrentEnv().isReady()` and `isMaster()` checks should be done once 
before entering the retry loop rather than repeated inside it.
   > 
   > **3. Logical counter overflow recovery gap**
   > 
   > In `getTSO()`, when `logical > MAX_LOGICAL_COUNTER`, the code sleeps and 
retries. But `generateTSO()` returns `logicalCounter + 1` without resetting the 
counter — the reset only happens when `updateTimestamp()` daemon advances the 
physical time via `setTSOPhysical()`. If the daemon tick hasn't run yet, all 
concurrent `getTSO()` calls will see overflow and fail until the daemon catches 
up.
   > 
   > **4. `unprotectedCommitTransaction` signature change**
   > 
   > `unprotectedCommitTransaction` and `unprotectedCommitTransaction2PC` now 
throw `TransactionCommitFailedException`. This is a behavioral change — all 
callers need to be verified. Subclasses (e.g. cloud variants) that override 
these methods must also update their signatures.
   > 
   > **5. `calibrateTimestamp()` requires `enable_tso_persist_journal=true` but 
it defaults to `false`**
   > 
   > Enabling `enable_tso_feature=true` without also setting 
`enable_tso_persist_journal=true` causes the service to perpetually fail 
calibration in `runAfterCatalogReady()`. These config dependencies should be 
enforced or at least clearly documented.
   > 
   > ### Moderate Issues
   > **6. `_async_publish_tasks` still uses anonymous `std::tuple`**
   > 
   > `storage_engine.h` changed the async publish map from `pair` to 
`tuple<int64_t, int64_t, int64_t>` with positional access (`std::get<0>`, 
etc.). Since `DiscontinuousVersionTablet` was correctly introduced as a named 
struct for the other case, consider doing the same here for consistency.
   > 
   > **7. `Rowset::make_visible` has no default for `commit_tso`**
   > 
   > The signature changed from `make_visible(Version)` to 
`make_visible(Version, int64_t commit_tso)` without a default parameter. The BE 
`txn_manager.h` overloads correctly use `= -1`, but `rowset.h` does not. Any 
code (including out-of-tree or test code) calling the old signature will break.
   > 
   > **8. `/api/tso` authentication check**
   > 
   > `TSOAction.getTSO()` calls `executeCheckPassword(request, response)` but 
doesn't check its return value. Other REST actions in the codebase typically 
verify the result or use `checkGlobalAuth`. This may allow unauthenticated 
access to TSO state.
   > 
   > **9. `TSOClockBackwardException` propagates through `MasterDaemon`**
   > 
   > If clock backward exceeds the threshold, a `RuntimeException` propagates 
from `runAfterCatalogReady()`. Depending on how `MasterDaemon` handles uncaught 
exceptions in the run loop, this could kill the daemon thread permanently.
   > 
   > **10. Concurrency model mixes `ReentrantLock` and 
`AtomicLong`/`AtomicBoolean`**
   > 
   > `windowEndTSO` is an `AtomicLong` read without the lock in 
`updateTimestamp()`, while `isInitialized` is an `AtomicBoolean` set under lock 
in some paths but checked without it in others. This is technically safe but 
makes the concurrency contract harder to reason about. Consider documenting 
which fields are protected by what.
   > 
   > ### Minor Issues
   > **11. Missing spaces in config descriptions**
   > 
   > Several `@ConfField` description strings are missing trailing spaces 
before concatenation:
   > 
   > * `"retry 3 times" + "to update"` → `"retry 3 timesto update"`
   > * `"5000ms from BDBJE once."` → should have space before `"will apply"`
   > * Same pattern in `tso_max_get_retry_count`, 
`tso_service_window_duration_ms`, `tso_time_offset_debug_mode`
   > 
   > **12. `FeMetaVersion` bump mentioned but not in diff**
   > 
   > The PR description mentions bumping `VERSION_CURRENT` to `VERSION_141`, 
but this file is not in the changed files list. Is the version bump missing or 
in a separate PR?
   > 
   > **13. Regression test may silently pass when TSO is disabled**
   > 
   > `test_tso_api.groovy` and `test_tso_rowset_commit_tso.groovy` — if the 
test cluster doesn't have `enable_tso_feature` and `enable_tso_persist_journal` 
set, the tests may skip or pass without exercising the feature.
   > 
   > Overall the BE-side plumbing (threading `commit_tso` through publish, 
rowset meta, information_schema) is clean. The FE core needs attention on the 
concurrency/retry model in `getTSO()` and the config dependency chain.
   
   1. Config naming inconsistency
   `enable_tso_feature` is the code field name, while 
`experimental_enable_tso_feature` is the external config name for the 
experimental feature. The `experimental` prefix follows the system's 
established convention.
   
   2. `getTSO()` contains sleep in the transaction commit path
   This is an intentional bounded retry, designed to handle transient states 
during master‑slave failover or readiness flutter. Moving the checks outside 
the loop would turn recoverable transient errors into immediate failures. The 
maximum blocking time is also configurable, not unbounded.
   
   3. Logical counter overflow "recovery gap"
   This is by design as backpressure, not a bug: on overflow, we wait for the 
daemon to advance physical time to guarantee monotonicity and persistence 
semantics. Reset and persistence are not performed on the hot `getTSO()` path.
   
   4. `unprotectedCommitTransaction*` signature change
   This is an explicit upgrade to failure semantics (abort commit if TSO 
acquisition fails). All in‑tree call chains have been aligned, and I have not 
found any overridden methods that are not adapted.
   
   5. `calibrateTimestamp()` requires `enable_tso_persist_journal=true`
   This is an intentional fail‑fast constraint to prevent unsafe execution such 
as “TSO enabled without persistence”. The default value being false is part of 
a gradual rollout strategy, not a defect.
   
   6. `_async_publish_tasks` uses tuple
   This minimizes code changes, and relevant comments have already been added 
in the code.
   
   7. `Rowset::make_visible` has no default parameter
   All in‑tree call sites now explicitly pass `commit_tso`, so there is no 
issue causing existing code to fail at compile or runtime.
   
   8. `/api/tso` missing authentication check
   The semantics of `executeCheckPassword()` are to throw an exception directly 
on failure, rather than relying on return value checking. Many other REST 
interfaces in the codebase follow the same pattern.
   
   9. `TSOClockBackwardException` causes daemon thread exit
   This will not happen. `Daemon.run()` catches `Throwable` at the top level, 
so the thread proceeds to the next round and is not permanently terminated by 
this exception.
   
   10. Mixed use of `ReentrantLock` and `Atomic*`
   This is intentional layering: high‑frequency read‑only states use atomic 
variables, while compound state such as `globalTimestamp` uses locks. The 
thread‑safety model is clear and correct.
   
   11. Spaces in config description strings
   This has been fixed.
   
   12. FeMetaVersion bump mentioned but not present in code
   During earlier code review, it was determined that the FeMetaVersion change 
was unnecessary. That part of the code has been removed, and the description 
updated accordingly.
   
   13. Regression tests “silently pass” when TSO is disabled
   Both regression tests explicitly run `ADMIN SET FRONTEND CONFIG` at the 
beginning to enable `experimental_enable_tso_feature` and 
`enable_tso_persist_journal`. They do not run and pass vacuously with TSO 
disabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to