sjperkins commented on issue #16264: URL: https://github.com/apache/arrow/issues/16264#issuecomment-2312962029
> @sjperkins looking at the past discussion I agree extension type is the way to go. I'm not sure about the storage type though. It would be nice to have zero-copy path to numpy which, as per [this](https://github.com/apache/arrow/issues/39753#issuecomment-1908608726), would need fixed_size_list approach. > @sjperkins looking at the past discussion I agree extension type is the way to go. I'm not sure about the storage type though. It would be nice to have zero-copy path to numpy which, as per [this](https://github.com/apache/arrow/issues/39753#issuecomment-1908608726), would need fixed_size_list approach. /cc @maupardh1 who started #39754 I don't think using `FixedSizeBinary` precludes zero-copy -- The following, similar to @zeroshade 's suggestion in https://github.com/apache/arrow/issues/39753#issuecomment-1908608726 seems to work (although I realise more work would be needed at the cython layer to accept `np.array(..., np.complex64)` in `pa.array`) ```python import pyarrow as pa import numpy as np import pyarrow as pa import numpy as np from numpy.testing import assert_array_equal COMPLEX64_STORAGE_TYPE = pa.binary(8) class ComplexFloatExtensionType(pa.ExtensionType): def __init__(self): pa.ExtensionType.__init__(self, COMPLEX64_STORAGE_TYPE, 'complex64') def __arrow_ext_serialize__(self): return b'' @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized_data): return ComplexFloatExtensionType() def wrap_array(self, storage_array): return pa.ExtensionArray.from_storage(self, storage_array) def __arrow_ext_class__(self): return ComplexFloatExtensionArray class ComplexFloatExtensionArray(pa.ExtensionArray): @classmethod def from_numpy(cls, array): if array.dtype != np.complex64: raise ValueError("Only complex64 dtype is supported") storage_array = pa.FixedSizeBinaryArray.from_buffers( COMPLEX64_STORAGE_TYPE, len(array), [None, pa.py_buffer(array.view(np.uint8))] ) return pa.ExtensionArray.from_storage(ComplexFloatExtensionType(), storage_array) def to_numpy(self, zero_copy_only=True, writeable=False): return np.frombuffer(self.storage.buffers()[1], dtype=np.complex64) # Register the extension type with Arrow pa.register_extension_type(ComplexFloatExtensionType()) data = np.array([1 + 2j, 3 + 4j, 5 + 6j, 7 + 8j], dtype=np.complex64) arrow_data = ComplexFloatExtensionArray.from_numpy(data) # Arrow buffers use Numpy buffer assert arrow_data.storage.buffers()[1].address == data.view(np.uint8).ctypes.data roundtrip = arrow_data.to_numpy() assert_array_equal(roundtrip, data) # Final array uses original array buffers assert roundtrip.ctypes.data == data.ctypes.data data2 = pa.array(data, type=ComplexFloatExtensionType()) assert_array_equal(data, data2) tt = pa.fixed_shape_tensor(ComplexFloatExtensionType(), (2,)) storage = pa.FixedSizeListArray.from_arrays(arrow_data, 2) assert len(storage) == 2 tensor = pa.ExtensionArray.from_storage(tt, storage) print(arrow_data) print(tensor) ``` One downside might be that there isn't yet support for custom extension type output - https://github.com/apache/arrow/issues/36648 so at the C++ arrow layer, the developer would be looking at a bunch of binary data. But it's an extension type and it wouldn't preclude developers from interpreting the fixed width buffers as `std::complex<float>` via the `data_as/mutable_data_as` methods. > That said, wouldn't it be possible to interpret Fixed/VariableShapeTensor as a complex tensor if an extra dimension of size 2 was added (and strides were done correctly, namely the complex dimension had the smallest stride)? I think the memory layout in this case would match numpy's. FixedShapeTensor can be zero-copied to numpy today and can probably be cast to complex in numpy. It would of course be better to add a new extension type to not make users do this manually. Yes this should work -- I've used something similar with nested FixedSizeListArrays to represent complex arrays whose underlying buffers can simply be passed to the appropriate NumPy method. However, would this not create the need to special case a lot of type handling? i.e. there may need to be: 1. A basic ComplexFloat + ComplexDouble 2. A FixedShapeTensor and a ComplexFixedShapeTensor (or possibly indicate complex in the serialized metadata?) 3. Same for VariableShapeTensors 4. and maybe other compound types. If the above are valid concerns, my bias is towards the `pa.binary(8/16)` storage type. It's fixed width (so works as a tensor type), has the same binary format as `std::complex<float/double>` and `np.complex64/128`. Actually to be fair the underlying FixedSizeList buffer also has the same binary format, but is not fixed width because nulls are possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
