rok commented on issue #12553:
URL: https://github.com/apache/arrow/issues/12553#issuecomment-1132855049

   Thanks for the background info @madhavajay !
   
   > From a tabular perspective the data is essentially a row for each data 
subject (of whom we are protecting their privacy), like 1 large n-dim tensor, 
2x similar ndim tensors (min and max providing bounds for the DP algorithms)
   
   That sounds like a good fit for TensorArray as proposed in #8510 for c++ or 
as implemented in Python 
[here](https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/tensor.py#L282)
   
   > potentially with a lot of repeated data (so we made a custom datatype 
called lazyrepeatarray which removes duplicate dimensions)
   
   Given you duplicate dimensions - would [CSF sparse 
tensors](https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/python/pyarrow/tests/test_sparse_tensor.py#L247-L271)
 help? Unlike CSR and CSC it will support n-dimensional tensors. Reduced size 
for transport and storage would be the main benefit as you'd still need dense 
forms for computation. Well some aggregate functions can be applied to sparse 
formats too..
   
   > However if there is no ability to do computation with pyarrow on that data 
then we just need to take it back out anyway. Currently were doing things like 
aggregate sum operations etc but we are implementing the entire suite of ops 
required for DL so we need numpy style flexibility.
   
   Numpy does indeed seem the safest option. However as David mentioned there 
is ongoing work on Python UDFs, [existing features are tested 
here](https://github.com/vibhatha/arrow/blob/master/python/pyarrow/tests/test_udf.py).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to