MaheshGPai opened a new issue, #473:
URL: https://github.com/apache/datasketches-cpp/issues/473
## Bug: tdigest const_iterator returns dangling reference causing incorrect
values
### Summary
The `tdigest<T>::const_iterator` has as issue where it returns
`std::pair<const T&, const W>` with a reference to a temporary value, leading
to undefined behavior and incorrect results when iterating over centroids.
### Root Cause
In `tdigest.hpp` line 319, the iterator's `value_type` is defined as:
```cpp
using value_type = std::pair<const T&, const W>;
```
In `tdigest_impl.hpp` line 681, `operator*()` returns:
```cpp
return value_type(centroids_[index_].get_mean(),
centroids_[index_].get_weight());
```
The problem is that `centroid::get_mean()` returns **by value** (line 83 in
tdigest.hpp):
```cpp
T get_mean() const { return mean_; } // Returns T, not T&
```
When constructing the pair, `const T&` binds to the **temporary** `T`
returned by `get_mean()`. This temporary is destroyed immediately after the
pair is constructed, leaving the reference dangling.
### Impact
- Reading `centroid.first` or `it->first` returns garbage values or always
returns the same incorrect value
- The issue manifests differently depending on compiler optimizations and
build environment
- In some environments (like ClickHouse), all iterations return the same
incorrect value (e.g., `0.0` for all centroids)
- In other environments, you get random garbage values like `3.00897e-314`
### Reproduction
Compile and run this test program with optimizations enabled (`-O2` or
`-O3`):
```cpp
#include <iostream>
#include <memory>
#include "tdigest.hpp"
int main() {
auto td = std::make_unique<datasketches::tdigest<double>>(100);
// Insert distinct values
for (int i = 0; i < 10; i++) {
td->update(i);
}
std::cout << "Iterating with explicit begin/end:" << std::endl;
auto it = td->begin();
auto end_it = td->end();
int count = 0;
while (it != end_it) {
double mean = it->first; // Dangling reference!
uint64_t weight = it->second;
std::cout << count++ << ": mean=" << mean << ", weight=" << weight
<< std::endl;
++it;
}
return 0;
}
```
**Expected output:** Different mean values (0, 1, 2, 3, ..., 9)
**Actual output:** Same value repeated or garbage values:
```
0: mean=0.000, weight=1
1: mean=0.000, weight=1
2: mean=0.000, weight=1
...
```
or
```
0: mean=3.00897e-314, weight=1
1: mean=3.00897e-314, weight=1
...
```
### Why range-based for loops sometimes work
Range-based for loops with `const auto&&` may appear to work in simple test
programs due to compiler-specific temporary lifetime extension, but this is not
guaranteed and fails in complex build environments.
### Proposed Fix
Change the `value_type` to return values instead of references:
**In `tdigest.hpp` line 319:**
```cpp
// Before:
using value_type = std::pair<const T&, const W>;
// After:
using value_type = std::pair<T, W>;
```
This ensures that `operator*()` returns a pair containing **copied values**
rather than a reference to a temporary, eliminating the undefined behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]