waynexia commented on issue #4840: URL: https://github.com/apache/arrow-rs/issues/4840#issuecomment-1906616562
I'd like to provide some though from a user perspective. When processing data where null is very common, it's natural to looking for a way to reduce the comsuption of those null values. As it's known that some part of the data is missing and we of cause can make optimization based on that. But I find it seems to be difficult at present. I can't use `NullArray` for not only the type problem and the "logical or physical" problem, but also the parquet side, where requires the array must be the same type with schema. And in this scenario a `Null` type never occurs -- some other parts will have data. Here some data are "logical null", but I can't give the answer of whether it's "physical null" (or should I even consider it?). (BTW, if I want write this part of data to parquet, or passing/compute it under a given schema, I can only build a corresponding array, and fills `None` one by one. This is costly comparing to how a `NullArray` works.) From whether the type is null and whether the value is null, we can give four (!!) types of null. When the type is null, test function like `is_null()` gives `true` when the value presents and is null (a), and gives `false` when the value is missing (b). And when the type is others, the null value of cause `is_null()` (c) and non-null value is not `is_null()` (d). Please correct me if this is not correct. By listing them down, some questions come to my mind: - Is it really necessary to distinguish case (a) and (b)? I have to use a new word "present" to say the difference. - Comparing case (a) and (c), does it means we have the fifth type of null that the type is not null but value "doesn't present"? - Null value should be a wildcard value, as it can fit into other types (case c). This is done by letting `None` to be a valid value for array. - We should have two kinds of null array. One for (a) and (b) where the type is null, and another one for (c) where the array is a compond array. Physical and logical null are truly confusing. But it there any way to make it intuitive and easy to use :thinking: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
