Re: [I] [Format] Add metadata for single and double precision complex numbers [arrow]

via GitHub Tue, 27 Aug 2024 09:02:42 -0700


sjperkins commented on issue #16264:
URL: https://github.com/apache/arrow/issues/16264#issuecomment-2312962029


   > @sjperkins looking at the past discussion I agree extension type is the 
way to go. I'm not sure about the storage type though. It would be nice to have 
zero-copy path to numpy which, as per 
[this](https://github.com/apache/arrow/issues/39753#issuecomment-1908608726), 
would need fixed_size_list approach.
   
   > @sjperkins looking at the past discussion I agree extension type is the 
way to go. I'm not sure about the storage type though. It would be nice to have 
zero-copy path to numpy which, as per 
[this](https://github.com/apache/arrow/issues/39753#issuecomment-1908608726), 
would need fixed_size_list approach.
   
   /cc @maupardh1 who started #39754
   
   
   I don't think using `FixedSizeBinary` precludes zero-copy -- The following, 
similar to @zeroshade 's suggestion in  
https://github.com/apache/arrow/issues/39753#issuecomment-1908608726 seems to 
work (although I realise more work would be needed at the cython layer to 
accept `np.array(..., np.complex64)` in `pa.array`)
   
   ```python
   import pyarrow as pa
   import numpy as np
   
   import pyarrow as pa
   import numpy as np
   from numpy.testing import assert_array_equal
   
   COMPLEX64_STORAGE_TYPE = pa.binary(8)
   
   class ComplexFloatExtensionType(pa.ExtensionType):
       def __init__(self):
           pa.ExtensionType.__init__(self, COMPLEX64_STORAGE_TYPE, 'complex64')
   
       def __arrow_ext_serialize__(self):
           return b''
   
       @classmethod
       def __arrow_ext_deserialize__(cls, storage_type, serialized_data):
           return ComplexFloatExtensionType()
   
       def wrap_array(self, storage_array):
           return pa.ExtensionArray.from_storage(self, storage_array)
   
       def __arrow_ext_class__(self):
           return ComplexFloatExtensionArray
   
   class ComplexFloatExtensionArray(pa.ExtensionArray):
       @classmethod
       def from_numpy(cls, array):
           if array.dtype != np.complex64:
               raise ValueError("Only complex64 dtype is supported")
           storage_array = pa.FixedSizeBinaryArray.from_buffers(
               COMPLEX64_STORAGE_TYPE, len(array),
               [None, pa.py_buffer(array.view(np.uint8))]
           )
           return pa.ExtensionArray.from_storage(ComplexFloatExtensionType(), 
storage_array)
   
       def to_numpy(self, zero_copy_only=True, writeable=False):
           return np.frombuffer(self.storage.buffers()[1], dtype=np.complex64)
   
   # Register the extension type with Arrow
   pa.register_extension_type(ComplexFloatExtensionType())
   
   data = np.array([1 + 2j, 3 + 4j, 5 + 6j, 7 + 8j], dtype=np.complex64)
   arrow_data = ComplexFloatExtensionArray.from_numpy(data)
   
   # Arrow buffers use Numpy buffer
   assert arrow_data.storage.buffers()[1].address == 
data.view(np.uint8).ctypes.data
   roundtrip = arrow_data.to_numpy()
   assert_array_equal(roundtrip, data)
   # Final array uses original array buffers
   assert roundtrip.ctypes.data == data.ctypes.data
   
   data2 = pa.array(data, type=ComplexFloatExtensionType())
   assert_array_equal(data, data2)
   
   tt = pa.fixed_shape_tensor(ComplexFloatExtensionType(), (2,))
   storage = pa.FixedSizeListArray.from_arrays(arrow_data, 2)
   assert len(storage) == 2
   tensor = pa.ExtensionArray.from_storage(tt, storage)
   
   print(arrow_data)
   print(tensor)
   ```
   
   One downside might be that there isn't yet support for custom extension type 
output
   
   - https://github.com/apache/arrow/issues/36648
   
   so at the C++ arrow layer, the developer would be looking at a bunch of 
binary data. But it's an extension type and it wouldn't preclude developers 
from interpreting the fixed width buffers as `std::complex<float>` via the 
`data_as/mutable_data_as` methods.
   
   > That said, wouldn't it be possible to interpret Fixed/VariableShapeTensor 
as a complex tensor if an extra dimension of size 2 was added (and strides were 
done correctly, namely the complex dimension had the smallest stride)? I think 
the memory layout in this case would match numpy's. FixedShapeTensor can be 
zero-copied to numpy today and can probably be cast to complex in numpy. It 
would of course be better to add a new extension type to not make users do this 
manually.
   
   Yes this should work -- I've used something similar with nested 
FixedSizeListArrays to represent complex arrays whose underlying buffers can 
simply be passed to the appropriate NumPy method. However, would this not 
create the need to special case a lot of type handling? i.e. there may need to 
be:
   
   1. A basic ComplexFloat + ComplexDouble
   2. A FixedShapeTensor and a ComplexFixedShapeTensor (or possibly indicate 
complex in the serialized metadata?)
   3. Same for VariableShapeTensors
   4. and maybe other compound types.
   
   If the above are valid concerns, my bias is towards the `pa.binary(8/16)` 
storage type.  It's fixed width (so works as a tensor type), has the same 
binary format as `std::complex<float/double>` and `np.complex64/128`. Actually 
to be fair the underlying FixedSizeList buffer also has the same binary format, 
but is not fixed width because  nulls are possible.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Format] Add metadata for single and double precision complex numbers [arrow]

Reply via email to