madhavajay commented on issue #12553:
URL: https://github.com/apache/arrow/issues/12553#issuecomment-1132319226

   @mrkn and @rok We are implementing a numpy like interface to large tensors 
with side car data to provide Differential Privacy and Secure Multi-Party 
Compute.
   
   In the end we decided to only use the pyarrow Tensor format for zero copy, 
and then we store that in Capnp to send over the network because we need 
additional meta data and protobuf is max 2GB. We did try to shove it all in a 
single pyarrow record batch but that didn't really work. Finally for 
computation on the other end were using jax since its basically DL Tensor numpy 
data.
   
   From a tabular perspective the data is essentially a row for each data 
subject (of whom we are protecting their privacy), like 1 large n-dim tensor, 
2x similar ndim tensors (min and max providing bounds for the DP algorithms) 
but potentially with a lot of repeated data (so we made a custom datatype 
called lazyrepeatarray which removes duplicate dimensions), and finally there 
is information about the data subject, so in a perfect world that is all one 
single record but to make efficient use of the data and zero copy we need that 
all to be in a single array (column) for each data type rather than a record / 
row. However if there is no ability to do computation with pyarrow on that data 
then we just need to take it back out anyway. Currently were doing things like 
aggregate `sum` operations etc but we are implementing the entire suite of ops 
required for DL so we need numpy style flexibility.
   
   Perhaps we were simply not utilising pyarrow correctly to best take 
advantage of what is possible. Also we are trying to avoid the `serde` which 
allows for code paths that support Python Objects aka `pickle` due to known 
security vulnerabilities.
   
   I hope that helps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to