madhavajay opened a new issue #12553:
URL: https://github.com/apache/arrow/issues/12553


   Hi,
   I have read through the docs and issues as best as I can and I am under the 
impression that its not possible to do compute functions on nested arrays.
   
   I modified a group_by & aggregate example like so, putting the pa.array 
values into nested lists.
   ```
   t = pa.table([
         pa.array(["a", "a", "b", "b", "c"]),
         pa.array([[1], [2], [3], [4], [5]]),
   ], names=["keys", "values"])
   
   t.group_by("keys").aggregate([("values", "sum")])
   ```
   
   The error is this:
   ```
   ArrowNotImplementedError: Function 'hash_sum' has no kernel matching input 
types (array[list<item: int64>], array[uint32])
   ```
   
   I assume this means the function doesn't know how to operate on a list? Is 
there a way to do this? I have large tensors which I can reshape into 1 
dimension to store in a Record Batch, but I don't know how I can perform 
computations on their values. It seems like the other way is to use the Tensor 
type but it can't be used in a Record Batch or with compute can it?
   
   The PyArrow zero copy from Numpy means this is an effective way to get data 
across the network using the IPC writer and its fairly easy to add other record 
types for custom meta data, but it would be a pity to have to then send this 
data back to numpy for all my computations and lose out on all that great SIMD 
parallelization.
   
   Is there a better way?
   
   Related links:
   https://github.com/apache/arrow/issues/4802
   https://issues.apache.org/jira/browse/ARROW-1614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to