madhavajay opened a new issue #12553:
URL: https://github.com/apache/arrow/issues/12553
Hi,
I have read through the docs and issues as best as I can and I am under the
impression that its not possible to do compute functions on nested arrays.
I modified a group_by & aggregate example like so, putting the pa.array
values into nested lists.
```
t = pa.table([
pa.array(["a", "a", "b", "b", "c"]),
pa.array([[1], [2], [3], [4], [5]]),
], names=["keys", "values"])
t.group_by("keys").aggregate([("values", "sum")])
```
The error is this:
```
ArrowNotImplementedError: Function 'hash_sum' has no kernel matching input
types (array[list<item: int64>], array[uint32])
```
I assume this means the function doesn't know how to operate on a list? Is
there a way to do this? I have large tensors which I can reshape into 1
dimension to store in a Record Batch, but I don't know how I can perform
computations on their values. It seems like the other way is to use the Tensor
type but it can't be used in a Record Batch or with compute can it?
The PyArrow zero copy from Numpy means this is an effective way to get data
across the network using the IPC writer and its fairly easy to add other record
types for custom meta data, but it would be a pity to have to then send this
data back to numpy for all my computations and lose out on all that great SIMD
parallelization.
Is there a better way?
Related links:
https://github.com/apache/arrow/issues/4802
https://issues.apache.org/jira/browse/ARROW-1614
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]