It would be nice to avoid the problem by changing the semantics of
Iceberg’s notNull, but I don’t think that’s a good idea for 2 main reasons.

First, I think that API users creating expressions directly expect the
current behavior. It would be surprising to a user if a notEqual expression
didn’t return nulls. People integrating Iceberg in SQL engines will be more
aware of SQL semantics, especially if the behavior we choose is documented.
I think for API uses, the current behavior is better.

Second, some evaluations require expressions to be rewritten without not,
so we have a utility that pushes not down an expression tree to the leaf
predicates. Rewriting not(equal("col", "x")) will result in notEqual("col",
"x"). If we were to change the semantics of notEqual, then this rewrite
would no longer be valid. If col is null then it is not equal to x, and
negating that result is true. But notEqual would give a different answer so
we can’t rewrite it.

We can work around the rewrite problem by adding Expressions.sqlNotEqual
method for engines to call that has the SQL semantics by returning
and(notEqual("col",
"x"), notNull("col")).

For pushdown, we should add tests for these cases and rewrite expressions
to account for the difference. Iceberg should push notEqual("col", "x") to
ORC as SQL (col != 'x' or col is null). Presto can similarly translate col
!= 'x' to and(notEqual("col", "x"), notNull("col").

rb

On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <owen.omal...@gmail.com>
wrote:

> I think that we should follow the SQL semantics to prevent surprises when
> SQL engines integrate with Iceberg.
>
> .. Owen
>
> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <shardulsmaha...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I noticed that Iceberg's predicates are not compatible with SQL
>> predicates when it comes to handling NULL values. In SQL, if any of the
>> operands of a scalar comparison predicate is NULL, then the resultant truth
>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>> resultant output will be different for Iceberg v/s SQL. e.g.
>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will
>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return
>> more rows than required when an engine pushes down these predicates,
>> however the engines will filter out these additional rows, so everything
>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>> on Iceberg's expression evaluators for filtering will return the additional
>> rows. Should we change the behavior of Iceberg expressions to be more
>> SQL-like or should we keep this behavior and document the differences when
>> compared with SQL?
>>
>> This also has some implications on predicate pushdown e.g. ORC follows
>> SQL semantics and if we try to push down Iceberg predicates, simply
>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>> as it does not return NULLs contrary to what Iceberg expects.
>>
>> Thanks,
>> Shardul
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to