jorgecarleitao commented on pull request #9454:
URL: https://github.com/apache/arrow/pull/9454#issuecomment-778036785
@abreis , I am not convinced that that is sufficient, unfortunately, because
it excludes all types that are not Numeric (i.e. all dates and times for
primitives, as well as all other logical types).
We could of course offer a `DatumTemporal` and a `DatumList` and so one and
so forth, but IMO that is introducing more complexity, as we now start having
`struct`s for all kinds of logical types.
Generally, the logical operation performed on the data depends on its
`DataType`. Thus, the scalar needs to contain that logical data type
information.
In my opinion, the data structure should be something like what DataFusion
has, but instead of having one enum variant per logical type, we should have
one enum variant per physical type, i.e.
```rust
enum ScalarValue {
Int32(Option<i32>, DataType), // int32, date32, time32
Int64(Option<i64>, DataType), // int64, date64, time64, timestamp
List(Option<Vec<ScalarValue>>, DataType),
}
```
The `Option` indicates validity of the type, `DataType` represents the
logical type.
This corresponds to the notion that each variant needs to be treated
fundamentally different because its physical layout (i.e. at the machine level)
is different. This is how Rust handles these things with generics, because it
relies on type information to compile the instructions.
This would allow us to write our numerical operations neatly using a generic
over `ArrowNativeType`, as `T::Native` would have a one-to-one correspondence
to the physical type in the enum.
We still have a problem over which a `PrimitiveArray` currently needs to be
downcasted to its logical - not physical - representation, and thus we still
have a lot of useless downcastings. Specifically, `PrimitiveArray<Int32Type>`
and `PrimitiveArray<Date32Type>` have the same in-memory representation, their
only difference is a constant value at `ArrayData::DataType`; there is no
reason to have 2 variants here: we just need one variant and two values (the
datatypes). [Point
10](https://github.com/jorgecarleitao/arrow2/tree/proposal#10-make-primitivearrayt-nativetype-instead-of-primitivetype)
of my proposal addresses this by splitting the logical and physical. I.e. make
it `PrimitiveArray<i32>`. Regardless, this is just to explain why imo we should
aim at one variant per physical, not logical type.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]