xiedeyantu commented on PR #21054:
URL: https://github.com/apache/datafusion/pull/21054#issuecomment-4114934688
@alamb Thank you for your feedback! I have reviewed the DuckDB behavior and
the discussions in the linked GitHub issues. This is indeed a complex issue
concerning NULL handling semantics in composite type (e.g., struct/row)
comparisons.
The core of the problem, as detailed in the provided links, is the
inconsistency in how different databases handle comparisons when struct/row
fields contain NULL values. Specifically:
* The initial issue (#11292) pointed out that DuckDB's comparison semantics
for structs with NULL fields were non-standard and did not match
PostgreSQL/SQLite. This was fixed in Pull Request #11496, which aimed to align
DuckDB's semantics more closely with PostgreSQL's behavior for nested
comparisons.
* However, a later issue (#18039) highlighted that the "fix" introduced a
Postgres incompatibility in certain cases. The discussion reveals that
PostgreSQL itself has inconsistent semantics: the result of a row comparison
depends on the context (e.g., comparing two row constructors directly vs.
within a join). PostgreSQL's official documentation acknowledges this
inconsistency, stating that for consistent sorting and indexing, it treats
NULLs within composite types differently than in other contexts.
* A related issue (#18339) demonstrates a concrete consequence: a
"NOT IN" query with multiple columns (which is semantically a
multi-column/row comparison) can produce incorrect results due to this NULL
comparison logic.
The DuckDB maintainer (@hawkfish) concluded in the discussion for issue
#18039 that because PostgreSQL's behavior is itself inconsistent and
context-dependent ("horribly inconsistent"), DuckDB might stick with its
current, deliberate semantics (which treat NULLs in nested comparisons as
comparable, akin to
"NULLS LAST" semantics) rather than try to match PostgreSQL's inconsistent
model.
Regarding your question about documentation:
Based on the links, DuckDB did update its documentation to clarify its
chosen semantics. The PR #11496 was labeled "Needs Documentation", and a
related documentation update is referenced. Therefore, if your system has made
a deliberate choice about its NULL semantics for composite type comparisons
(whether to follow one standard, be inconsistent like PostgreSQL, or define its
own), it is highly advisable to document this behavior clearly. The
documentation should specify how comparisons are evaluated when NULLs are
present in struct/row fields, and note any known deviations from other database
systems like PostgreSQL.
In summary, while there is no single universally "correct" solution due to
PostgreSQL's own inconsistencies, clearly documenting your system's chosen
semantics is the recommended path forward.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]