github-actions[bot] commented on code in PR #60812:
URL: https://github.com/apache/doris/pull/60812#discussion_r2882336531


##########
be/src/runtime/fragment_mgr.cpp:
##########
@@ -674,6 +678,18 @@ void 
FragmentMgr::remove_pipeline_context(std::pair<TUniqueId, int> key) {
 }
 
 void FragmentMgr::remove_query_context(const TUniqueId& key) {
+    // Clean up any saved rerunnable params for this query to avoid memory 
leaks.
+    // This covers both cancel and normal destruction paths.
+    {
+        std::lock_guard<std::mutex> lk(_rerunnable_params_lock);
+        for (auto it = _rerunnable_params_map.begin(); it != 
_rerunnable_params_map.end();) {
+            if (it->first.first == key) {
+                it = _rerunnable_params_map.erase(it);
+            } else {
+                ++it;

Review Comment:
   **Bug: Potential use-after-free of raw pointer**
   
   `query_ctx` is a raw pointer obtained from `fragment_ctx->get_query_ctx()`. 
After `fragment_ctx.reset()` (line 1497) and `remove_pipeline_context()` (line 
1498) drop the PFC's `shared_ptr<QueryContext>`, the `QueryContext` lifetime 
depends on other owners — primarily the `shared_ptr<QueryContext>` in 
`_rerunnable_params_map` and `_query_ctx_map_delay_delete`.
   
   If a concurrent `remove_query_context()` (triggered by query cancellation) 
runs at this point, it erases the `_rerunnable_params_map` entries (line 
683-691) and `_query_ctx_map_delay_delete` (line 693), potentially destroying 
the `QueryContext` while this raw pointer is still in use (in the lambda at 
line 1505).
   
   Suggested fix: Use a `shared_ptr<QueryContext>` to keep the `QueryContext` 
alive through this scope, similar to how the `recreate_and_submit` stage does 
it:
   ```cpp
   auto q_ctx = get_query_ctx(query_id);
   if (!q_ctx) {
       return Status::NotFound(...);
   }
   SCOPED_ATTACH_TASK(q_ctx.get());
   ```
   Or extract it from the PFC before resetting:
   ```cpp
   auto query_ctx = fragment_ctx->shared_query_ctx(); // if such accessor exists
   ```



##########
gensrc/proto/internal_service.proto:
##########
@@ -75,11 +75,10 @@ message PTransmitRecCTEBlockResult {
 
 message PRerunFragmentParams {
     enum Opcode {
-    wait = 1;    // wait fragment execute done
-    release = 2; // release current round's resource
-    rebuild = 3; // rebuild next round pipeline tasks
-    submit = 4;  // submit tasks to execute
-    close = 5;   // close fragment
+    wait_for_close = 1;       // wait for all tasks to close, without sending 
report or removing PFC
+    wait_for_destroy = 2;     // deregister RF, destroy old PFC, wait for 
external threads to release refs

Review Comment:
   **Compatibility: Enum value remapping may break rolling upgrades**
   
   The old enum values on master are:
   ```
   wait = 1; release = 2; rebuild = 3; submit = 4; close = 5;
   ```
   The new values reuse 1-4 with different semantics:
   ```
   wait_for_close = 1; wait_for_destroy = 2; recreate_and_submit = 3; 
final_close = 4;
   ```
   
   During a rolling upgrade of a Doris cluster, a node running old code 
receiving `wait_for_destroy=2` would interpret it as `release=2`, leading to 
incorrect behavior. Similarly, `recreate_and_submit=3` would be interpreted as 
`rebuild=3`.
   
   If recursive CTE is used during rolling upgrades, this could cause subtle 
failures. Consider:
   1. Using higher enum values (e.g., starting at 10) to avoid overlap, or
   2. Adding a version/compatibility check, or
   3. Documenting that recursive CTE queries should not be run during rolling 
upgrades.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to