CRZbulabula opened a new pull request, #17754:
URL: https://github.com/apache/iotdb/pull/17754
## Summary
- **Root cause**: `QueryExecution.retry()` re-plans the query via
`doDistributionPlan()`, which creates new `PlanFragmentId` objects with
`nextFragmentInstanceId` reset to 0. Since the `queryId` is unchanged, the
retry generates fragment instance IDs **identical** to the first execution
(e.g. `queryId_11.0`). `FragmentInstanceManager.instanceContext` retains
completed contexts for up to 5 minutes for statistics caching. When the retry
dispatches the same FI ID, `instanceContext.computeIfAbsent()` returns the
**stale old context** whose `releaseResource()` has already run — setting
`dataRegion` to `null`. New drivers then NPE at `dataRegion.tryReadLock()`
inside `FragmentInstanceContext.initQueryDataSource()`.
- **Primary fix** (`FragmentInstanceManager.java`): Replace
`instanceContext.computeIfAbsent()` with `instanceContext.compute()` inside
`execDataQueryFragmentInstance()`. The compute function atomically replaces any
existing context whose `dataRegion` is `null` (i.e. already released) with a
fresh context carrying the new `dataRegion` reference.
- **Defensive fix** (`FragmentInstanceContext.java`): Add a `dataRegion ==
null` guard at the top of `getSharedQueryDataSource()`. If triggered, return
`null` (which `DataDriver.initialize()` already handles as an aborted FI)
instead of propagating an NPE.
## Reproduction
The bug is reliably reproduced when:
1. A data query fails and `QueryExecution.retry()` is triggered (max 3
retries, 2 s apart).
2. The first execution's `FragmentInstanceContext` has been released but not
yet evicted from `instanceContext`.
3. The retry dispatches a fragment instance with the same ID, gets the stale
context back from `computeIfAbsent`, and the driver NPEs.
Observed stack trace:
```
java.lang.NullPointerException: Cannot invoke
"IDataRegionForQuery.tryReadLock(long)" because "this.dataRegion" is null
at
FragmentInstanceContext.initQueryDataSource(FragmentInstanceContext.java:652)
at
FragmentInstanceContext.getSharedQueryDataSource(FragmentInstanceContext.java:786)
at DataDriverContext.getSharedQueryDataSource(DataDriverContext.java:104)
at DataDriver.initQueryDataSource(DataDriver.java:148)
at DataDriver.initialize(DataDriver.java:102)
at DataDriver.init(DataDriver.java:61)
at Driver.lambda$processFor$1(Driver.java:147)
```
## Test plan
- [ ] Verify unit tests in `iotdb-core/datanode` still pass
- [ ] Reproduce the retry scenario in a cluster environment and confirm no
NPE occurs
- [ ] Confirm that normal query execution and EXPLAIN ANALYZE statistics
caching are unaffected
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]