alamb opened a new issue, #9230: URL: https://github.com/apache/arrow-datafusion/issues/9230
### Describe the bug The logic introduced in https://github.com/apache/arrow-datafusion/pull/9208 is (very subtly) incorrect as I found while upgrading to use it in InfluxDB ### To Reproduce The data is like this ``` name: cpu +---------------------+--------------+ | time | usage_system | +---------------------+--------------+ | 1970-01-01T00:01:00 | 99.8 | | 1970-01-01T00:01:00 | 89.5 | | 1970-01-01T00:01:10 | 88.6 | | 1970-01-01T00:01:10 | 99.7 | | 1970-01-01T00:01:20 | | | 1970-01-01T00:01:20 | | | 1970-01-01T00:01:30 | 83.4 | | 1970-01-01T00:01:30 | | | 1970-01-01T00:01:40 | 87.7 | | 1970-01-01T00:01:40 | | | 1970-01-01T00:01:50 | | | 1970-01-01T00:01:50 | | | 1970-01-01T00:02:00 | | | 1970-01-01T00:02:00 | 99.9 | | 1970-01-01T00:02:10 | 89.8 | | 1970-01-01T00:02:10 | 99.8 | | 1970-01-01T00:02:20 | | | 1970-01-01T00:02:20 | 99.9 | | 1970-01-01T00:02:30 | | | 1970-01-01T00:02:30 | | | 1970-01-01T00:02:40 | | | 1970-01-01T00:02:40 | | | 1970-01-01T00:02:50 | 89.8 | | 1970-01-01T00:02:50 | | | 1970-01-01T00:03:00 | 90.0 | | 1970-01-01T00:03:00 | 99.8 | | 1970-01-01T00:03:10 | | | 1970-01-01T00:03:10 | 99.8 | +---------------------+--------------+ ``` Note it has BOTH 14 null values and 14 non null values for usage_system and non null values The original predicate was ``` usage_system@3 IS NOT NULL AND time@1 <= 1707945813397450000 ``` And the rewritten predicate is (now) ``` usage_system_null_count@0 = 0 AND time_min@1 <= 1707945813397450000 ``` Thus, the input data statistics look like this ``` +-------------------------+---------------------+ | usage_system_null_count | time_min | +-------------------------+---------------------+ | 14 | 1970-01-01T00:01:00 | +-------------------------+---------------------+ ``` With these statistics this predicate evaluates to false and the data is pruned, ### Expected behavior The data should not be pruned because there are rows for which `IS NOT NULL` evaluates to true (there are non null values in the data) ### Additional context The pruning predicate needs to return `false` only if "there are no rows that could possibly match the predicate", which for `IS NOT NULL` means that there are *only* null values in the data, but the current implementation checks if there are *any* null values in the data. So in this case, that the original predicate needs to be rewritten to ``` usage_system_null_count@0 = usage_system_row_count@0 AND time_min@1 <= 1707945813397450000 ``` We of course don't have `row_count` information yet (but @appletreeisyellow) is working on adding in https://github.com/apache/arrow-datafusion/issues/9171 This kind of subtle logic bug is another reason I think we should be seriously considering the range based analysis described in https://github.com/apache/arrow-datafusion/issues/7887 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
