tobixdev commented on issue #8730:
URL: https://github.com/apache/arrow-rs/issues/8730#issuecomment-3772204011

   I can understand both viewpoints. On the one hand, adding 
`DataType::Extension` no longer allows separating physical arrays from 
extension types in the type system. On the other hand, it may be the most 
direct way of getting extension types (and possibly the only way that will 
actually be implemented).
   
   Maybe I can propose another approach. Maybe the issue is that at the same 
time arrow-rs does allow us to provide extension type information (in `Field`) 
but does not feature any types that convey the idea that an `ArrayRef` may have 
additional metadata which requires a different interpretation. This makes it 
easy to just use the `ArrayRef` and ignore the data type.
   
   The central idea is that in arrow-rs, once an array has been associated with 
a `Field` (where there is potentially metadata), there is a type 
`NativeOrExtensionArray` that encodes the possibility of metadata in the type 
system. As a result, if any code encounters a `NativeOrExtensionArray` there is 
a possibility that an array must be interpreted differently. If the code 
encounters an `ArrayRef` it must be interpreted as the underlying `DataType`. 
From my experience, the `RecordBatch` will be the central point of use for this 
new type. This hopefully should make it clear to downstream users (e.g., 
DataFusion) that metadata needs to be preserved and opting out of it needs an 
explicit method call.
   
   Here is a possibility:
   
   ```rust
   /// Represents either an array of arrow-native types (see [`DataType`]) or 
an array of an extension
   /// type.
   ///
   /// This type should be used in APIs where users of arrow-rs can store 
metadata that changes the
   /// interpretation of the associated physical array. For example, in a 
record batch, each array can be
   /// associated with metadata. This metadata can drastically change the 
interpretation of the array.
   /// [`NativeOrExtensionArray::Extension`] reflects such a different 
interpretation. Kernels in
   /// arrow-rs are usually defined only for physical [`ArrayRef`]. It's the 
responsibility of the
   /// downstream user to check whether it makes sense to unwrap a 
[`ExtensionArrayRef`] and call a
   /// particular arrow-rs kernel.
   pub enum NativeOrExtensionArray {
       /// An array of arrow-native types.
       Physical(ArrayRef),
       /// An array of an extension type.
       Extension(ExtensionArray),
   }
   ```
   
   For example, the new `column` method on `RecordBatch`:
   
   ```rust
   /// Get a reference to a column's array by index.
   ///
   /// # Panics
   ///
   /// Panics if `index` is outside of `0..num_columns`.
   pub fn column(&self, index: usize) -> &NativeOrExtensionArray {
       &self.columns[index]
   }
   ```
   
   The `ExtensionArray` can be seen as a wrapper around a physical `ArrayRef`. 
Again, the type system now conveys the information that the underlying 
`ArrayRef` may require a different interpretation. Downstream users can choose 
to ignore that by "unwrapping" the `ArrayRef`. Importantly, the 
`ExtensionArray` provides access to an arrow-rs trait `DynExtesnionType`. This 
trait allows validating whether an `ExtensionArrayRef` can be associated with a 
particular field (check if extension type name and metadata matches). 
Associating an `ArrayRef` with an extension type field will error, associating 
an `ExtensionArrayRef` with a field with no extension type will also error.
   
   ```rust
   /// Represents an array of an extension type.
   pub struct ExtensionArray {
       /// The inner physical array.
       inner: ArrayRef,
       /// The extension type of this array.
       extension_type: Arc<dyn DynExtensionType>,
   }
   
   impl ExtensionArray {
       /// Creates a new [`ExtensionArray`], validating that the extension type 
is compatible with the
       /// inner array. 
       pub fn new(inner: ArrayRef, extension_type: Arc<dyn DynExtensionType>) 
-> Result<Self, ArrowError> {
           todo!("Validate extension type compatible with inner");
           Ok(Self {
               inner,
               extension_type,
           })
       }
   
       /// Returns the inner array of this extension array.
       fn inner(&self) -> &ArrayRef {
           &self.inner
       }
   
       /// Returns the inner array, consuming the [`ExtensionArray`].
       fn into_inner(self) -> ArrayRef {
           self.inner
       }
   
       /// Returns the extension type of this extension array. This allows 
users to access the
       /// extension type name and its metadata.
       ///
       /// The extension type can be used to validate whether a field 
(including its metadata) is
       /// compatible with this extension type. This can be used, for example, 
to validate the creation
       /// of a record batch.
       fn extension_type(&self) -> &Arc<dyn DynExtensionType> {
           &self.extension_type
       }
   }
   ```
   
   I am definitely not the most qualified  person on arrow-rs and DataFusion 
but I think that the problem mentioned at the beginning (arrow-rs allows 
specifying Metadata but provides no types that capture this information) could 
lead to some of the downstream problems (e.g., forgetting to consider metadata 
at all). 
   
   I know this is again a disruptive change but it would be localized to the 
part of arrow-rs that allows associating an `ArrayRef` with a `Field`. Most 
kernels themselves that are only concerned with the physical `ArrayRef` do not 
need to be adapted and we can distinguish in the type system between "may be an 
extension type" and "cannot be an extension type" even for arrays.
   
   The effect on DataFusion will be bigger but it's difficult to tell for me. 
Maybe we can confine it to the transformation from `ArrayRef` to 
`ColumnarValue` and roll out the new arrow-rs types piece by piece.
   
   Any thoughts on that? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to