MaheshGPai opened a new issue, #473:
URL: https://github.com/apache/datasketches-cpp/issues/473

   ## Bug: tdigest const_iterator returns dangling reference causing incorrect 
values
   
   ### Summary
   
   The `tdigest<T>::const_iterator` has as issue where it returns 
`std::pair<const T&, const W>` with a reference to a temporary value, leading 
to undefined behavior and incorrect results when iterating over centroids.
   
   ### Root Cause
   
   In `tdigest.hpp` line 319, the iterator's `value_type` is defined as:
   ```cpp
   using value_type = std::pair<const T&, const W>;
   ```
   
   In `tdigest_impl.hpp` line 681, `operator*()` returns:
   ```cpp
   return value_type(centroids_[index_].get_mean(), 
centroids_[index_].get_weight());
   ```
   
   The problem is that `centroid::get_mean()` returns **by value** (line 83 in 
tdigest.hpp):
   ```cpp
   T get_mean() const { return mean_; }  // Returns T, not T&
   ```
   
   When constructing the pair, `const T&` binds to the **temporary** `T` 
returned by `get_mean()`. This temporary is destroyed immediately after the 
pair is constructed, leaving the reference dangling.
   
   ### Impact
   
   - Reading `centroid.first` or `it->first` returns garbage values or always 
returns the same incorrect value
   - The issue manifests differently depending on compiler optimizations and 
build environment
   - In some environments (like ClickHouse), all iterations return the same 
incorrect value (e.g., `0.0` for all centroids)
   - In other environments, you get random garbage values like `3.00897e-314`
   
   ### Reproduction
   
   Compile and run this test program with optimizations enabled (`-O2` or 
`-O3`):
   
   ```cpp
   #include <iostream>
   #include <memory>
   #include "tdigest.hpp"
   
   int main() {
       auto td = std::make_unique<datasketches::tdigest<double>>(100);
       
       // Insert distinct values
       for (int i = 0; i < 10; i++) {
           td->update(i);
       }
       
       std::cout << "Iterating with explicit begin/end:" << std::endl;
       auto it = td->begin();
       auto end_it = td->end();
       
       int count = 0;
       while (it != end_it) {
           double mean = it->first;  // Dangling reference!
           uint64_t weight = it->second;
           std::cout << count++ << ": mean=" << mean << ", weight=" << weight 
<< std::endl;
           ++it;
       }
       
       return 0;
   }
   ```
   
   **Expected output:** Different mean values (0, 1, 2, 3, ..., 9)
   
   **Actual output:** Same value repeated or garbage values:
   ```
   0: mean=0.000, weight=1
   1: mean=0.000, weight=1
   2: mean=0.000, weight=1
   ...
   ```
   or
   ```
   0: mean=3.00897e-314, weight=1
   1: mean=3.00897e-314, weight=1
   ...
   ```
   
   ### Why range-based for loops sometimes work
   
   Range-based for loops with `const auto&&` may appear to work in simple test 
programs due to compiler-specific temporary lifetime extension, but this is not 
guaranteed and fails in complex build environments.
   
   ### Proposed Fix
   
   Change the `value_type` to return values instead of references:
   
   **In `tdigest.hpp` line 319:**
   ```cpp
   // Before:
   using value_type = std::pair<const T&, const W>;
   
   // After:
   using value_type = std::pair<T, W>;
   ```
   
   This ensures that `operator*()` returns a pair containing **copied values** 
rather than a reference to a temporary, eliminating the undefined behavior.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to