paleolimbot commented on issue #8730:
URL: https://github.com/apache/arrow-rs/issues/8730#issuecomment-3769906303

   > I presume the proposal is to add DataType::Extension(Box<DataType>, 
Box<dyn Any>)
   
   A few ways to do this, but a DataType enum variant with something dynamic 
(e.g., https://github.com/apache/arrow-rs/pull/7398 ).
   
   > If so how do these arrays get created?
   
   There are a number of places that have implemented Arrow extension types 
with a registry (Polars Arrow, Arrow C++, Arrow Go, Arrow Java, DuckDB, 
nanoarrow R, nanoarrow Python) to draw on to answer this question. Briefly, any 
operation that produces a schema or a record batch from something that isn't an 
arrow-rs object is responsible for providing a registry to resolve an extension 
type. In practice that works quite well (arrow-rs is great at propagating its 
own DataType objects).
   
   The problem of metadata/type mismatch is not unique to a dynamic Extension 
data type and even exists today: any attempt to set metadata might obliterate 
type information that an application is relying on. In fact, a dynamic DataType 
extension is specifically designed to alleviate the most common version of that 
issue (which is dropping the extension type on every call to `.data_type()` or 
`.column()`).
   
   > If lack of engagement is the issue
   
   I don't think it is lack of people interested, it is that the two options 
mentioned above (rewrite the DataType/Field/Schema/Array/RecordBatch stack and 
related APIs) are sufficiently disruptive that nobody want nobody wants to 
review them, or sufficiently hacky that nobody wants to implement them (perhaps 
just speaking for myself on this last point).
   
   > but how downstreams choose to glue this together is not prescribed by 
arrow-rs.
   
   Providing an optional dynamic DataType extension is not perscribing a 
mechanism to glue together extension handling; it is providing a tool to 
propagate type implementations that applications can choose to opt in to 
without writing a parallel DataType/Field/Schema/Array/RecordBatch stack. This 
certainly could be integrated with arrow-rs native APIs but I don't think any 
of us are suggesting that as a part of this proposal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to