Hi Chris, Thanks for the reminder.
My understanding is that this is not a security attack issue, but a query correctness/runtime bug. My intention was to make the community aware of the issue, especially because it was originally reported by community users through another channel and we confirmed it afterwards. If users happen to run into the same problem before 1.3.8 is released, they could also manually cherry-pick the related fix commit as a temporary mitigation. That said, I understand your concern about disclosure sensitivity, especially when detailed reproduction steps are included. I will be more careful with this kind of information in future discussions. Regarding the wording around “open-source release”, thanks for pointing that out as well. I will pay attention to this and avoid such wording in future Apache discussions. Best regards, -------------------------- Yuan Tian On Fri, Jul 3, 2026 at 4:36 PM Christofer Dutz <[email protected]> wrote: > Hi Yuan, > > I tink it’s not ideal to disclose security issues before the release > candidate is officially released. > If there is discussion needed on this, they should happen on the private > list (or a security list, if the project has one). > After the RC is released, a separate announcement would be the ideal path. > This way we are disclosing an attack vector with users having no way to > mitigate possibly giving malicious entities an exact guide on how to > exploit. > > And your email contained the wording I was referring to in the other > discussion … „open-source release“ 😉 > What else should this be? I know that internally the commercial offering > is referred to as „enterprise version“ > and the other is naturally the „open-source version“ … Just mentioning it > to increase sensitivity on this > > Chris > > > Von: Yuan Tian <[email protected]> > Datum: Freitag, 3. Juli 2026 um 01:49 > An: dev <[email protected]> > Betreff: [DISCUSS] Kicking off 1.3.8 release: critical query livelock fix > in 1.3.6 / 1.3.7 > > Hi all, > > I'd like to propose that we finalize the scope of the next open-source > release 1.3.8 and start the release process as soon as possible. The main > motivation is a critical query bug that affects both v1.3.6 and v1.3.7. > The fix is already merged on dev/1.3, so 1.3.8 should at least include > it. > > == Bug description and severity == > > Under a specific but fairly common data pattern, a query on an aligned > device enters a livelock: it never returns and never errors out, while > the > query driver thread spins at ~100% CPU repeatedly burning its full time > slice, until the query finally hits the timeout. I would rate this as > critical: > > * The affected query always fails (by timeout), with no error message > that points to the cause, which makes it very hard for users to > diagnose. > * Each stuck query pins query driver threads at full CPU for the entire > timeout window. A handful of such queries can saturate the query > thread pool and CPU, degrading all other queries on the node. > * The trigger pattern is common in practice: an aligned device where > the > queried measurement is sparse (contains nulls), combined with a time > range filter. Aggregations (e.g. count) and raw queries are both > affected, in both ASC and DESC order. > > We hit this in production on v1.3.6: EXPLAIN ANALYZE snapshots showed the > scan operators' CPU time growing linearly (~60s per 15s wall time across > driver threads) while output rows and all I/O statistics stayed > completely > frozen, and CPU flame graphs showed ~90% of samples inside > SeriesScanUtil.initFirstChunkMetadata (with ~1/3 of that in the > System.nanoTime() calls of the time-slice guard loop, i.e. a pure busy > wait). > > == How to reproduce (verified on 1.3.6 / 1.3.7) == > > CREATE DATABASE root.sg1; > INSERT INTO root.sg1.d1(timestamp, s1, s2) ALIGNED VALUES (1, 1, 1); > INSERT INTO root.sg1.d1(timestamp, s1, s2) ALIGNED VALUES (2, null, 2); > INSERT INTO root.sg1.d1(timestamp, s1, s2) ALIGNED VALUES (3, null, 3); > FLUSH; > SELECT s1 FROM root.sg1.d1 WHERE time >= 3 AND time <= 4 ORDER BY time > DESC; > > Expected: an empty result set. Actual: the query hangs until timeout. > An ascending variant triggers the same livelock, e.g. > "SELECT count(s1) FROM ... WHERE time <= X" when s1's non-null values all > lie after X (this is the shape we hit in production). > > == Root cause == > > Two statistics sources got out of sync: > > * File-level pruning (TimeFilter#canSkip) used the *time-column* > statistics of the aligned timeseries metadata. > * SeriesScanUtil's overlap checks use > ITimeSeriesMetadata#getStatistics(), > which for a single-measurement aligned scan returns the > *value-column* > statistics (the non-null range, a subset of the time-column range). > > Since v1.3.6, the memtable scan optimization (commit dbc0133a on dev/1.3) > additionally clamps the overlap-check endpoint by the global time filter > * SeriesScanUtil's overlap checks use > ITimeSeriesMetadata#getStatistics(), > which for a single-measurement aligned scan returns the > *value-column* > statistics (the non-null range, a subset of the time-column range). > > Since v1.3.6, the memtable scan optimization (commit dbc0133a on dev/1.3) > additionally clamps the overlap-check endpoint by the global time filter > range. As a result, a file whose time-column range overlaps the filter > but > whose queried measurement has no non-null value inside the filter range > passes canSkip() and gets loaded, yet the clamped endpoint can never > overlap the metadata's own statistics. initFirstChunkMetadata() then > neither unpacks nor discards firstTimeSeriesMetadata, hasNextChunk() > keeps > returning Optional.empty(), and the operator's time-slice loop spins > forever. v1.3.5 and earlier are not affected because the overlap endpoint > was the metadata's own endTime, which always overlaps itself. > > == The fix == > > Already on dev/1.3: > > * apache/tsfile#716 — TimeFilter.canSkip()/allSatisfy() now use > getStatistics(), consistent with the scan-side overlap checks. > (develop-branch equivalent: apache/tsfile#715) > * apache/iotdb#17120 — bumps dev/1.3 to a tsfile version containing the > fix and adds a regression IT > (testQueryWithGlobalTimeFilterOrderByTimeDesc). > > Note that dev/1.3 currently depends on tsfile 1.1.4-SNAPSHOT, so an > official tsfile 1.1.4 release is a prerequisite for releasing IoTDB > 1.3.8. > > == Proposal == > > 1. Release tsfile 1.1.4 (dev/1.1) first. > 2. Cut rc/1.3.8 from dev/1.3 shortly after, which already contains the > fix above as well as several other correctness fixes in the same > area > (e.g. #16993, #16970). > 3. If you have other fixes or changes that should go into 1.3.8, please > reply in this thread so we can settle the scope quickly. > > Any feedback is welcome. > > Best regards, > ---------------- > Yuan Tian >
