My thoughts: 1. I've lost track of the higher level encryption implementation in C++. I think we were trying to come to a consensus on the threading/thread safety model?
2. I'm open to exposing the lower level encryption libraries in python (without appropriate namespacing/communication). It seems at least for reading, there is potentially less harm (I'll caveat that with I'm not a security expert). Are both the low level read and write implementations necessary? (it probably makes sense to have a few smaller PRs for exposing this functionality anyways). On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < ita...@pythonspeed.com> wrote: > Hi, > > Since the PR for high-level C++ Parquet encryption API appears stalled ( > https://github.com/apache/arrow/pull/8023), I'm looking into exposing the > low-level Parquet encryption API to Python. > > Arguments for doing this: the low-level API is all the users I'm talking > to need, at the moment, so it's plausible others would also find some > benefit in having the Pyarrow API expose low-level Parquet encryption. Then > again, it might only be this one company and no one else cares. > > The arguments against, per Gidon Gershinsky: > > > * security: low-level encryption API is easy to misuse (eg giving the > same keys for a number of different files; this'd break the AES GCM > cipher). The high-level encryption layer handles that by applying envelope > encryption and other best practices in data security. Also, this layer is > maintained by the community, meaning that future improvements and security > fixes can be upstreamed by anyone, and available to all. > > * compatibility: parquet-mr implements the high-level encryption layer. > If we want the files produced by Spark/Presto/etc to be readable by > pandas/PyArrow (and vice versa), we need to provide the Arrow users with > the high-level API. > > ... > > > > The current situation is not ideal, it'd be good to merge the high-level > PR (and maybe hide the low level), but here we are; also, C++ is a kind of > a low-level language; Python would expose it to a less experienced audience. > > (Source: https://issues.apache.org/jira/browse/ARROW-8040) > > I find the compatibility argument less compelling, that's readily > addressed by documentation. I am not a crypto expert so I can't evaluate > how risky exposing the low-level encryption APIs would be, but I can see > how that would be a significant concern. > > Some options are: > * Status quo, no Python API for low-level Parquet encryption. This has > two possible outcomes: > * Eventually high-level API gets merged, gets Python binding. > * High-level encryption API is never merged, Python users never get > access to encryption. > * Add low-level Parquet encryption API to Pyarrow, perhaps using "hazmat" > idiom used by the Python cryptography package (API namespace indicating > "use at your own risk, this is dangerous", basically, e.g. > `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). > * Gidon Gershinsky did not find this suggestion compelling enough to > override his security concerns. > * Low-level encryption done as third party Python package, either private > or open source. This is annoying technically, plausibly would require > maintaining a fork. > Any other ideas? Thoughts on these options? > > —Itamar