jorgecarleitao commented on pull request #9454:
URL: https://github.com/apache/arrow/pull/9454#issuecomment-778036785


   @abreis , I am not convinced that that is sufficient, unfortunately, because 
it excludes all types that are not Numeric (i.e. all dates and times for 
primitives, as well as all other logical types).
   
   We could of course offer a `DatumTemporal` and a `DatumList` and so one and 
so forth, but IMO that is introducing more complexity, as we now start having 
`struct`s for all kinds of logical types.
   
   Generally, the logical operation performed on the data depends on its 
`DataType`. Thus, the scalar needs to contain that logical data type 
information.
   
   In my opinion, the data structure should be something like what DataFusion 
has, but instead of having one enum variant per logical type, we should have 
one enum variant per physical type, i.e.
   
   ```rust
   enum ScalarValue {
       Int32(Option<i32>, DataType),  // int32, date32, time32
       Int64(Option<i64>, DataType), // int64, date64, time64, timestamp
       List(Option<Vec<ScalarValue>>, DataType),
   }
   ```
   The `Option` indicates validity of the type, `DataType` represents the 
logical type.
   
   This corresponds to the notion that each variant needs to be treated 
fundamentally different because its physical layout (i.e. at the machine level) 
is different. This is how Rust handles these things with generics, because it 
relies on type information to compile the instructions.
   
   This would allow us to write our numerical operations neatly using a generic 
over `ArrowNativeType`, as `T::Native` would have a one-to-one correspondence 
to the physical type in the enum.
   
   We still have a problem over which a `PrimitiveArray` currently needs to be 
downcasted to its logical - not physical - representation, and thus we still 
have a lot of useless downcastings. Specifically, `PrimitiveArray<Int32Type>` 
and `PrimitiveArray<Date32Type>` have the same in-memory representation, their 
only difference is a constant value at `ArrayData::DataType`; there is no 
reason to have 2 variants here: we just need one variant and two values (the 
datatypes). [Point 
10](https://github.com/jorgecarleitao/arrow2/tree/proposal#10-make-primitivearrayt-nativetype-instead-of-primitivetype)
 of my proposal addresses this by splitting the logical and physical. I.e. make 
it `PrimitiveArray<i32>`. Regardless, this is just to explain why imo we should 
aim at one variant per physical, not logical type.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to