paleolimbot commented on issue #39689: URL: https://github.com/apache/arrow/issues/39689#issuecomment-1942003467
I just noticed that `__arrow_c_schema__` was missing when working on https://github.com/apache/arrow/pull/39985 . This is an interesting read, but I do think that adding `__arrow_c_schema__` will be beneficial. One of the problems is that there are two reasons you might want to call `obj.__arrow_c_schema__()`, which have been discussed above: Either `obj` *is* a data type-like object (e.g., a `pyarrow.DataType`, a `nanoarrow.Schema`, or a `numpy/dtype.dtype`, or `obj` *has* a data type (e.g., `pyarrow.Array`, `pandas.Series`, `numpy.array`). You might want to use the second version if you are a consumer that doesn't understand one of the new types that were just added to the spec and doesn't have the ability to cast. For example: ```python def split_lines(array): schema_src = array.__arrow_c_schema__() if nanoarrow.c_schema_view(schema_src).type == "string_view": schema_src, array_src = array.__arrow_c_array__(requested_schema=nanoarrow.string()) else: schema_src, array_src = array.__arrow_c_array__() if nanoarrow.c_schema_view(schema_src).type != "string": raise TypeError("array must be string or string_view") ``` In that case, you really do need the ability to get the data type from the producer in the event you have to request something else. This type of negotiation is (in my view) far superior to maintaining a spec for keyword arguments to `__arrow_c_array__()` that would help simple consumers get Arrow data they understand (while freeing producers to take advantage of newer/higher performance types without worrying about compatability). You might want to use the first one if you have a function like: ```python def cast(array, schema): schema_dst = schema.__arrow_c_schema() schema_src, array_src, = array.__arrow_c_array__() # ...do some casting stuff, maybe in C ``` Here, it would be very strange if you could pass a `pyarrow.Array` as the `schema` argument without an error. I think this can be disambiguated by checking `hasattr(obj, "__arrow_c_array__") or hasattr(obj, "__arrow_c_stream__")`: ```python def cast(array, schema): if hasattr(obj, "__arrow_c_array__") or hasattr(obj, "__arrow_c_stream__"): raise TypeError("Can't pass array-like object as schema") schema_dst = schema.__arrow_c_schema() schema_src, array_src, = array.__arrow_c_array__() # ...do some casting stuff, maybe in C ``` I will probably bake this in to `nanoarrow.c_schema()`, perhaps using another argument or another function to enable the case where you *do* want the data type from something that is array-like. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
