sjperkins commented on pull request #10565:
URL: https://github.com/apache/arrow/pull/10565#issuecomment-1000166824


   > With some changes, I was able to get it your branch to work and output a 
complex array. It is still preliminary. Let me know if you want to see it.
   
   Thanks, yes I would like to take a look.
    
   > I think the best way forward is to let the CPP handle the extension types.
   > 
   >     1. Don't strip the Extension information on the Cython side
   > 
   >     2. CPP could handle future extension types each with their own 
potential `Writer`. Instead of stripping the Extension information on the 
cython side, let the CPP handle it(e.g. if the extension is known, use the 
Writer, if it doesn't, use the storage). This will require some rewrite on the 
CPP side as the extension type is used in the conversion code but I don't 
understand when it applies. To me this looks like the least intrusive option.
   
   Yes, I agree that this is the way forward. Mirroring what you've described 
back to you, this will necessitate API changes to the C++ Extension Framework 
to support Custom Writers?
   
   > fixed_list_size_arrays generate a numpy array of pyarrow arrays. If 
fixed_list_size were to yield a 2D numpy array(which is directly mappable) 
there would be little needs for the extension side being handled at all. I 
guess this was done to ensure that the columns could be read into pandas. When 
calling `to_numpy` I would expect to get a 2D array.
   > 
   > As a separate issue, When trying to handle complex arrays from the python 
side only using extension arrays. It mostly works but the extension arrays 
don't play nice with ChunkedArrays
   > 
   > ```python
   > import pyarrow as pa
   > import pyarrow.parquet as pq
   > import numpy as np
   > 
   > class ComplexArray(pa.ExtensionArray):
   >     def to_numpy(self):
   >         array = 
self.storage.flatten().to_numpy(zero_copy_only=True).view('complex128')
   >         return array
   > 
   > class ComplexType(pa.ExtensionType):
   >     def __init__(self):
   >         pa.ExtensionType.__init__(self, pa.list_(pa.float64(), 2), 
"arrow.complex128")
   >     def __arrow_ext_serialize__(self):
   >         # since we don't have a parameterized type, we don't need extra
   >         # metadata to be deserialized
   >         return b''
   >     @classmethod
   >     def __arrow_ext_deserialize__(self, storage_type, serialized):
   >         # return an instance of this subclass given the serialized
   >         # metadata.
   >         return ComplexType()
   >     def __arrow_ext_class__(self):
   >         return ComplexArray
   > 
   > complex128 = ComplexType()
   > pa.register_extension_type(complex128)
   > 
   > array = np.array([1.0+1j, 2.0+2.0j, 3.0+3.0j]).view('float64')
   > float_array = pa.array(array, type=pa.float64())
   > list_array = pa.FixedSizeListArray.from_arrays(float_array, 2)
   > pa_array = ComplexArray.from_storage(complex128, list_array)
   > # This will yield the correct Numpy Array
   > assert (pa_array.to_numpy() == array).all()
   > ```
   > 
   > It works but we get into issues when writing/reading from parquet
   > 
   > ```python
   > table = pa.Table.from_arrays([pa_array], names=['cpl'])
   > pq.write_table(table, "complex_test.parquet")
   > cpl_df = pq.read_table("complex_test.parquet")
   > # This line will create a numpy array of pyarrow arrays
   > cpl_df['cpl'].to_numpy()
   > # If you have a single chunk, this will work
   > assert (cpl_df['cpl'].chunks[0].to_numpy() == array).all()
   > 
   > # Otherwise the solution is more involved
   > start_length = 0
   > end_length = 0
   > max_length = cpl_df['cpl'].length()
   > final_numpy = np.zeros(shape=(max_length), dtype=np.complex128)
   > for chunk in cpl_df['cpl'].iterchunks():
   >     temp_numpy = chunk.to_numpy()
   >     end_length += temp_numpy.shape[0]
   >     final_numpy[start_length:end_length] = temp_numpy
   >     start_length = end_length
   > assert (final_numpy == array).all()
   > ```
   
   It does work quite nicely in Python. In fact, I implemented something 
similar in a repository I'm working on: 
https://github.com/ska-sa/dask-ms/blob/55c987e5f00c24a82c16363f681a966e45590e06/daskms/experimental/arrow/extension_types.py#L145-L148.
 IIRC I ran into similar issues with chunked Extension Arrays.
   
   Do you have an idea as to how this should be handled in the C++ layer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to