Rich-T-kid commented on PR #21589:
URL: https://github.com/apache/datafusion/pull/21589#issuecomment-4263688598

   ### Null handling
   **How nulls are currently handled**
   The first time a null is encountered, a sentinel byte representation is 
written to `seen_elements` as well as the hash map. This sentinel exists so 
that during emit(), when `transform_into_array` is called, we can compare the 
raw bytes of each entry against the sentinel to distinguish a null from a real 
value — ensuring that a null is appended to the output array as an actual null 
rather than being written out as a value. Without this, there would be no way 
to tell from the raw byte representation alone whether an entry represents a 
null or a legitimate value that happens to have the same bytes.
   ### Why this is hard to extend to other types
   This approach works for Utf8 because there exist byte sequences that are 
invalid UTF-8, which can serve as an unambiguous sentinel value that could 
never be confused with real data. However, this is difficult to extend to other 
types. For example, there is no equivalent concept of an 'invalid' binary 
buffer, since any sequence of raw bytes is valid by definition. This problem 
extends to numeric types as well.
   One idea was to use an n+1 sized buffer to signal a null state for example, 
representing a null i32 as 5 bytes instead of the expected 4, where the extra 
byte acts as a flag to distinguish it from a valid value. However this feels 
clunky, adds overhead to every value comparison, and still doesn't resolve the 
core issue for binary types where no such invalid state can be expressed 
through the raw byte representation alone. This is worth its own investigation 
and is another reason to defer non-string types to a follow up issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to