madhavajay commented on issue #12553: URL: https://github.com/apache/arrow/issues/12553#issuecomment-1132319226
@mrkn and @rok We are implementing a numpy like interface to large tensors with side car data to provide Differential Privacy and Secure Multi-Party Compute. In the end we decided to only use the pyarrow Tensor format for zero copy, and then we store that in Capnp to send over the network because we need additional meta data and protobuf is max 2GB. We did try to shove it all in a single pyarrow record batch but that didn't really work. Finally for computation on the other end were using jax since its basically DL Tensor numpy data. From a tabular perspective the data is essentially a row for each data subject (of whom we are protecting their privacy), like 1 large n-dim tensor, 2x similar ndim tensors (min and max providing bounds for the DP algorithms) but potentially with a lot of repeated data (so we made a custom datatype called lazyrepeatarray which removes duplicate dimensions), and finally there is information about the data subject, so in a perfect world that is all one single record but to make efficient use of the data and zero copy we need that all to be in a single array (column) for each data type rather than a record / row. However if there is no ability to do computation with pyarrow on that data then we just need to take it back out anyway. Currently were doing things like aggregate `sum` operations etc but we are implementing the entire suite of ops required for DL so we need numpy style flexibility. Perhaps we were simply not utilising pyarrow correctly to best take advantage of what is possible. Also we are trying to avoid the `serde` which allows for code paths that support Python Objects aka `pickle` due to known security vulnerabilities. I hope that helps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
