[GitHub] [arrow] jmdeschenes commented on pull request #10565: ARROW-638: [C++] Complex Number Support via ExtensionTypes

GitBox Wed, 22 Dec 2021 08:22:04 -0800


jmdeschenes commented on pull request #10565:
URL: https://github.com/apache/arrow/pull/10565#issuecomment-999702255



   Thanks for the update.
   
   With some changes, I was able to get it your branch to work and output a 
complex array. It is still preliminary. Let me know if you want to see it.
   
   I think the best way forward is to let the CPP handle the extension types.
   
   1. Don't strip the Extension information on the Cython side
   2. CPP could handle future extension types each with their own potential 
`Writer`. Instead of stripping the Extension information on the cython side, 
let the CPP handle it(e.g. if the extension is known, use the Writer, if it 
doesn't, use the storage). This will require some rewrite on the CPP side as 
the extension type is used in the conversion code but I don't understand when 
it applies. To me this looks like the least intrusive option.
   
   fixed_list_size_arrays generate a numpy array of pyarrow arrays. If 
fixed_list_size were to yield a 2D numpy array(which is directly mappable) 
there would be little needs for the extension side being handled at all. I 
guess this was done to ensure that the columns could be read into pandas. When 
calling `to_numpy` I would expect to get a 2D array.
   
   As a separate issue, When trying to handle complex arrays from the python 
side only using extension arrays. It mostly works but the extension arrays 
don't play nice with ChunkedArrays
   
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   import numpy as np
   
   class ComplexArray(pa.ExtensionArray):
       def to_numpy(self):
           array = 
self.storage.flatten().to_numpy(zero_copy_only=True).view('complex128')
           return array
   
   class ComplexType(pa.ExtensionType):
       def __init__(self):
           pa.ExtensionType.__init__(self, pa.list_(pa.float64(), 2), 
"arrow.complex128")
       def __arrow_ext_serialize__(self):
           # since we don't have a parameterized type, we don't need extra
           # metadata to be deserialized
           return b''
       @classmethod
       def __arrow_ext_deserialize__(self, storage_type, serialized):
           # return an instance of this subclass given the serialized
           # metadata.
           return ComplexType()
       def __arrow_ext_class__(self):
           return ComplexArray
   
   complex128 = ComplexType()
   pa.register_extension_type(complex128)
   
   array = np.array([1.0+1j, 2.0+2.0j, 3.0+3.0j]).view('float64')
   float_array = pa.array(array, type=pa.float64())
   list_array = pa.FixedSizeListArray.from_arrays(float_array, 2)
   pa_array = ComplexArray.from_storage(complex128, list_array)
   # This will yield the correct Numpy Array
   assert (pa_array.to_numpy() == array).all()
   ```
   
   It works but we get into issues when writing/reading from parquet
   
   ```python
   table = pa.Table.from_arrays([pa_array], names=['cpl'])
   pq.write_table(table, "complex_test.parquet")
   cpl_df = pq.read_table("complex_test.parquet")
   # This line will create a numpy array of pyarrow arrays
   cpl_df['cpl'].to_numpy()
   # If you have a single chunk, this will work
   assert (cpl_df['cpl'].chunks[0].to_numpy() == array).all()
   
   # Otherwise the solution is more involved
   start_length = 0
   end_length = 0
   max_length = cpl_df['cpl'].length()
   final_numpy = np.zeros(shape=(max_length), dtype=np.complex128)
   for chunk in cpl_df['cpl'].iterchunks():
       temp_numpy = chunk.to_numpy()
       end_length += temp_numpy.shape[0]
       final_numpy[start_length:end_length] = temp_numpy
       start_length = end_length
   assert (final_numpy == array).all()
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jmdeschenes commented on pull request #10565: ARROW-638: [C++] Complex Number Support via ExtensionTypes

Reply via email to