the-mousaillon opened a new issue, #4259:
URL: https://github.com/apache/arrow-datafusion/issues/4259

   **Describe the bug**
   approx_quantile_cont panics, complaining that the input to TDigest is not 
ordered:
   ```panicked at 'unsorted input to TDigest'```
   
   I have done some digging to understand what happens and it seams to have 
something to do with a corruption of indexes whithin the arrow Array.
   
   I added a simple check to see if the array is ordered, and if not to print 
it in the 
[update_batch](https://github.com/apache/arrow-datafusion/blob/929a17525917727aa99b26667e9c3a8c55bd67c3/datafusion/physical-expr/src/aggregate/approx_percentile_cont.rs#L368-L374)
 function
   ```rust
       fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
           let values = &values[0];
           let v = downcast_value!(values, Int64Array);
           let sorted_values = &arrow::compute::sort(values, None)?;
           let a_down = downcast_value!(sorted_values, Int64Array);
   
           let mut is_sorted = true;
           for i in 0..sorted_values.len() -1 {
               if a_down.value(i) > a_down.value(i+1) {
                   is_sorted = false;
                   break;
               }
           }
           if !is_sorted {
              for i in 0..sorted_values.len() {
               println!("s: {}, b: {}", a_down.value(i), v.value(i));
           }
           }
           let sorted_values =
               
ApproxPercentileAccumulator::convert_to_ordered_float(sorted_values)?;
           self.digest = self.digest.merge_sorted_f64(&sorted_values);
           Ok(())
       }
   ```
   
   The output is surprising, we see that in seemingly every case, the first 
value of the sorted array *"s"* should be at the end of the array.
   For intance on this array :
   
![image](https://user-images.githubusercontent.com/36140579/202452429-3af4bf33-88c2-4580-a50f-4ee9b26dec28.png)
   
   The weird thing is that if I recreate the array, the sort works properly and 
the panic goes away.
   ```rust
       fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
           let values = &values[0];
           # If I just recreate the array before the sorting, the problem goes 
away
           let v = downcast_value!(values, Int64Array);
           let values: Int64Array = v.iter().map(|d| d.clone()).collect();
           let values = Arc::new(values) as ArrayRef;
           let values = &values;
           let sorted_values =
               
ApproxPercentileAccumulator::convert_to_ordered_float(sorted_values)?;
           self.digest = self.digest.merge_sorted_f64(&sorted_values);
           Ok(())
   }
   ```
   This makes me think that their may be some kind of index corruption within 
the array buffer.
   
   I had this bug while performing an approx_percentile_cont on a GROUP BY, 
whithout the GROUP BY it works fine.
   
   **To Reproduce**
   Steps to reproduce:
   1. Download the parquet file: 
[percentile_cont_bug.zip](https://github.com/apache/arrow-datafusion/files/10032416/percentile_cont_bug.zip)
   
   2. Execute this function
   
   ```rust
   fn percentile_cont_bug() ->Result<(), ()>{
       let ctx = SessionContext::new();
       let pq_path = "...path to the provided parquet file";
       ctx.register_parquet("example", pq_path, ParquetReadOptions::default())
           .await
           .map_err(|e|())?;
       
       let q = format!("
           SELECT 
               country,
               approx_median(age) as median_age,
           FROM example
           group by 
               country
       ");
       let df = ctx.sql(&q).await
       .map_err(|e| println!("err: {:?}", e))?;
       // execute and print results
       df.show().await;
       Ok(())
   }
   ```
   
   **Additional context**
   datafusion version: 14.0.0
   arrow version: 26.0
   platform: WSL (ubuntu)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to