alamb opened a new issue, #9230:
URL: https://github.com/apache/arrow-datafusion/issues/9230

   ### Describe the bug
   
   The logic introduced in https://github.com/apache/arrow-datafusion/pull/9208 
is (very subtly) incorrect as I found while upgrading to use it in InfluxDB
   
   ### To Reproduce
   
   The data is like this
   ```
   name: cpu
   +---------------------+--------------+
   | time                | usage_system |
   +---------------------+--------------+
   | 1970-01-01T00:01:00 | 99.8         |
   | 1970-01-01T00:01:00 | 89.5         |
   | 1970-01-01T00:01:10 | 88.6         |
   | 1970-01-01T00:01:10 | 99.7         |
   | 1970-01-01T00:01:20 |              |
   | 1970-01-01T00:01:20 |              |
   | 1970-01-01T00:01:30 | 83.4         |
   | 1970-01-01T00:01:30 |              |
   | 1970-01-01T00:01:40 | 87.7         |
   | 1970-01-01T00:01:40 |              |
   | 1970-01-01T00:01:50 |              |
   | 1970-01-01T00:01:50 |              |
   | 1970-01-01T00:02:00 |              |
   | 1970-01-01T00:02:00 | 99.9         |
   | 1970-01-01T00:02:10 | 89.8         |
   | 1970-01-01T00:02:10 | 99.8         |
   | 1970-01-01T00:02:20 |              |
   | 1970-01-01T00:02:20 | 99.9         |
   | 1970-01-01T00:02:30 |              |
   | 1970-01-01T00:02:30 |              |
   | 1970-01-01T00:02:40 |              |
   | 1970-01-01T00:02:40 |              |
   | 1970-01-01T00:02:50 | 89.8         |
   | 1970-01-01T00:02:50 |              |
   | 1970-01-01T00:03:00 | 90.0         |
   | 1970-01-01T00:03:00 | 99.8         |
   | 1970-01-01T00:03:10 |              |
   | 1970-01-01T00:03:10 | 99.8         |
   +---------------------+--------------+
   ```
   
   Note it has BOTH 14 null values and 14 non null values for usage_system and  
non null values
   
   The original predicate was
   ```
   usage_system@3 IS NOT NULL AND time@1 <= 1707945813397450000
   ```
   
   And the rewritten predicate is (now)
   ```
   usage_system_null_count@0 = 0 AND time_min@1 <= 1707945813397450000
   ```
   
   
   Thus, the input data statistics look like this
   ```
   +-------------------------+---------------------+
   | usage_system_null_count | time_min            |
   +-------------------------+---------------------+
   | 14                      | 1970-01-01T00:01:00 |
   +-------------------------+---------------------+
   ```
   
   With these statistics this  predicate evaluates to false and the data is 
pruned,
   
   
   ### Expected behavior
   
   The data should not be pruned because there are rows for which `IS NOT NULL` 
evaluates to true (there are non null values in the data)
   
   
   
   ### Additional context
   
   The pruning predicate needs to return `false` only if "there are no rows 
that could possibly match the predicate", which for `IS NOT NULL` means that 
there are *only* null values in the data, but the current implementation checks 
if there are *any* null values in the data.
   
   So in this case, that the original predicate needs to be rewritten to
   ```
   usage_system_null_count@0 = usage_system_row_count@0  AND time_min@1 <= 
1707945813397450000
   ```
   
   We of course don't have `row_count` information yet (but @appletreeisyellow) 
is working on adding in https://github.com/apache/arrow-datafusion/issues/9171
   
   This kind of subtle logic bug is another reason I think we should be 
seriously considering the range based analysis described in 
https://github.com/apache/arrow-datafusion/issues/7887
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to