Rich-T-kid commented on PR #21589: URL: https://github.com/apache/datafusion/pull/21589#issuecomment-4263688598
### Null handling **How nulls are currently handled** The first time a null is encountered, a sentinel byte representation is written to `seen_elements` as well as the hash map. This sentinel exists so that during emit(), when `transform_into_array` is called, we can compare the raw bytes of each entry against the sentinel to distinguish a null from a real value — ensuring that a null is appended to the output array as an actual null rather than being written out as a value. Without this, there would be no way to tell from the raw byte representation alone whether an entry represents a null or a legitimate value that happens to have the same bytes. ### Why this is hard to extend to other types This approach works for Utf8 because there exist byte sequences that are invalid UTF-8, which can serve as an unambiguous sentinel value that could never be confused with real data. However, this is difficult to extend to other types. For example, there is no equivalent concept of an 'invalid' binary buffer, since any sequence of raw bytes is valid by definition. This problem extends to numeric types as well. One idea was to use an n+1 sized buffer to signal a null state for example, representing a null i32 as 5 bytes instead of the expected 4, where the extra byte acts as a flag to distinguish it from a valid value. However this feels clunky, adds overhead to every value comparison, and still doesn't resolve the core issue for binary types where no such invalid state can be expressed through the raw byte representation alone. This is worth its own investigation and is another reason to defer non-string types to a follow up issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
