[GitHub] [arrow] westonpace commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

via GitHub Thu, 16 Feb 2023 12:42:02 -0800


westonpace commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1433688267


   > Yes, it does mean we will. Do you foresee that as an issue? It sounds like 
Java implementation takes the same approach.
   
   In datasets, for row group statistics, we [recently added a 
check](https://github.com/apache/arrow/pull/15125) that was roughly...
   
   ```
   if (is_nan(min) && is_nan(max)) {
     // Ignore statistics
   } else if (is_nan(min)) {
     // Assume x <= max
   } else if(is_nan(max)) {
     // Assume x >= min
   } else {
     // Assume min <= x <= max
   }
   ```
   
   In other words, if one of min or max is NaN then we still use the other side 
of the equality.  I think my primary concern is to validate that is a safe 
assumption.  In other words, I want to make sure we aren't using garbage data 
in our handling of row groups.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Reply via email to