jmdeschenes commented on pull request #10565:
URL: https://github.com/apache/arrow/pull/10565#issuecomment-999702255
Thanks for the update.
With some changes, I was able to get it your branch to work and output a
complex array. It is still preliminary. Let me know if you want to see it.
I think the best way forward is to let the CPP handle the extension types.
1. Don't strip the Extension information on the Cython side
2. CPP could handle future extension types each with their own potential
`Writer`. Instead of stripping the Extension information on the cython side,
let the CPP handle it(e.g. if the extension is known, use the Writer, if it
doesn't, use the storage). This will require some rewrite on the CPP side as
the extension type is used in the conversion code but I don't understand when
it applies. To me this looks like the least intrusive option.
fixed_list_size_arrays generate a numpy array of pyarrow arrays. If
fixed_list_size were to yield a 2D numpy array(which is directly mappable)
there would be little needs for the extension side being handled at all. I
guess this was done to ensure that the columns could be read into pandas. When
calling `to_numpy` I would expect to get a 2D array.
As a separate issue, When trying to handle complex arrays from the python
side only using extension arrays. It mostly works but the extension arrays
don't play nice with ChunkedArrays
```python
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
class ComplexArray(pa.ExtensionArray):
def to_numpy(self):
array =
self.storage.flatten().to_numpy(zero_copy_only=True).view('complex128')
return array
class ComplexType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.list_(pa.float64(), 2),
"arrow.complex128")
def __arrow_ext_serialize__(self):
# since we don't have a parameterized type, we don't need extra
# metadata to be deserialized
return b''
@classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return ComplexType()
def __arrow_ext_class__(self):
return ComplexArray
complex128 = ComplexType()
pa.register_extension_type(complex128)
array = np.array([1.0+1j, 2.0+2.0j, 3.0+3.0j]).view('float64')
float_array = pa.array(array, type=pa.float64())
list_array = pa.FixedSizeListArray.from_arrays(float_array, 2)
pa_array = ComplexArray.from_storage(complex128, list_array)
# This will yield the correct Numpy Array
assert (pa_array.to_numpy() == array).all()
```
It works but we get into issues when writing/reading from parquet
```python
table = pa.Table.from_arrays([pa_array], names=['cpl'])
pq.write_table(table, "complex_test.parquet")
cpl_df = pq.read_table("complex_test.parquet")
# This line will create a numpy array of pyarrow arrays
cpl_df['cpl'].to_numpy()
# If you have a single chunk, this will work
assert (cpl_df['cpl'].chunks[0].to_numpy() == array).all()
# Otherwise the solution is more involved
start_length = 0
end_length = 0
max_length = cpl_df['cpl'].length()
final_numpy = np.zeros(shape=(max_length), dtype=np.complex128)
for chunk in cpl_df['cpl'].iterchunks():
temp_numpy = chunk.to_numpy()
end_length += temp_numpy.shape[0]
final_numpy[start_length:end_length] = temp_numpy
start_length = end_length
assert (final_numpy == array).all()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]