paleolimbot commented on issue #39689:
URL: https://github.com/apache/arrow/issues/39689#issuecomment-1942003467

   I just noticed that `__arrow_c_schema__` was missing when working on 
https://github.com/apache/arrow/pull/39985 . This is an interesting read, but I 
do think that adding `__arrow_c_schema__` will be beneficial.
   
   One of the problems is that there are two reasons you might want to call 
`obj.__arrow_c_schema__()`, which have been discussed above: Either `obj` *is* 
a data type-like object (e.g., a `pyarrow.DataType`, a `nanoarrow.Schema`, or a 
`numpy/dtype.dtype`, or `obj` *has* a data type (e.g., `pyarrow.Array`, 
`pandas.Series`, `numpy.array`).
   
   You might want to use the second version if you are a consumer that doesn't 
understand one of the new types that were just added to the spec and doesn't 
have the ability to cast. For example:
   
   ```python
   def split_lines(array):
     schema_src = array.__arrow_c_schema__()
     if nanoarrow.c_schema_view(schema_src).type == "string_view":
       schema_src, array_src = 
array.__arrow_c_array__(requested_schema=nanoarrow.string())
     else:
       schema_src, array_src = array.__arrow_c_array__()
       
     if nanoarrow.c_schema_view(schema_src).type != "string":
       raise TypeError("array must be string or string_view")
   ```
   
   In that case, you really do need the ability to get the data type from the 
producer in the event you have to request something else. This type of 
negotiation is (in my view) far superior to maintaining a spec for keyword 
arguments to `__arrow_c_array__()` that would help simple consumers get Arrow 
data they understand (while freeing producers to take advantage of newer/higher 
performance types without worrying about compatability).
   
   You might want to use the first one if you have a function like:
   
   ```python
   def cast(array, schema):
      schema_dst = schema.__arrow_c_schema()
      schema_src, array_src, = array.__arrow_c_array__()
      # ...do some casting stuff, maybe in C
   ```
   
   Here, it would be very strange if you could pass a `pyarrow.Array` as the 
`schema` argument without an error. I think this can be disambiguated by 
checking `hasattr(obj, "__arrow_c_array__") or hasattr(obj, 
"__arrow_c_stream__")`:
   
   
   ```python
   def cast(array, schema):
     if hasattr(obj, "__arrow_c_array__") or hasattr(obj, "__arrow_c_stream__"):
       raise TypeError("Can't pass array-like object as schema")
   
      schema_dst = schema.__arrow_c_schema()
      schema_src, array_src, = array.__arrow_c_array__()
      # ...do some casting stuff, maybe in C
   ```
   
   I will probably bake this in to `nanoarrow.c_schema()`, perhaps using 
another argument or another function to enable the case where you *do* want the 
data type from something that is array-like.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to