Hi Antoine, All comments have been handled. Can we ask you to shepherd this PR for the reminder of its lifecycle? (hopefully, most of this is already behind us). https://github.com/apache/arrow/pull/8023
Cheers, Gidon ---------- Forwarded message --------- From: Gidon Gershinsky <gg5...@gmail.com> Date: Thu, Feb 18, 2021 at 6:25 PM Subject: Re: Exposing low-level Parquet encryption to Python user (or, maybe not) To: dev <dev@arrow.apache.org> Thanks, then we'll just go ahead and address the remaining comments. Cheers, Gidon On Thu, Feb 18, 2021 at 5:45 PM Antoine Pitrou <anto...@python.org> wrote: > > I don't think there's any concern around having a process-global shared > key cache. The discussion was just around the implementation. > > Also, FTR, a standalone LRU cache class is proposed here, which may > reduce the amount of original code in the Parquet encryption PR: > https://github.com/apache/arrow/pull/8716 > > Best regards > > Antoine. > > > Le 18/02/2021 à 16:40, Gidon Gershinsky a écrit : > > I believe the shared structures that were debated are the key caches. > > > > Cheers, Gidon > > > > > > On Thu, Feb 18, 2021 at 6:37 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >>> > >>> I don't think any notion of threading should be present in the > >>> implementation, except for the required locks around shared structures. > >> > >> > >> I seem to recall the debate was how to model some class interactions to > >> determine what should be considered shared structures and what should > not. > >> > >> On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com> > wrote: > >> > >>> This certainly sounds good to me. > >>> > >>> Cheers, Gidon > >>> > >>> > >>> On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org> > >> wrote: > >>> > >>>> > >>>> I don't think any notion of threading should be present in the > >>>> implementation, except for the required locks around shared > structures. > >>>> I don't know where the idea of a "main thread" comes from, but it > >>>> probably shouldn't exist in a C++ library. > >>>> > >>>> Regards > >>>> > >>>> Antoine. > >>>> > >>>> > >>>> > >>>> Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit : > >>>>> Just to clarify. There are two options, which one do you refer to? A > >>>> design > >>>>> with a main thread that handles projections and the keys (relevant > >> for > >>>> the > >>>>> projected columns); or the current code with any thread allowed to > >>> handle > >>>>> full file reading, inc the footer, column projections and their keys? > >>> Can > >>>>> you finalize this with Micah? > >>>>> The good news is, Tham is still interested to resume this work, and > >> is > >>> ok > >>>>> with either option. Please let her know whether the current threading > >>>> model > >>>>> stays, or should be modified with the changes proposed in the doc > >> (for > >>>> the > >>>>> latter, some guidance with the details would be needed). > >>>>> > >>>>> Cheers, Gidon > >>>>> > >>>>> > >>>>> On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org> > >>>> wrote: > >>>>> > >>>>>> > >>>>>> > >>>>>> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit : > >>>>>>> From the doc, > >>>>>>> "To maintain consistency with the style of parquet-cpp, the above > >>>>>>> structures should not be explicitly synchronized with individual > >>>> mutexes. > >>>>>>> In the case of a parquet::arrow::FileReader, the request to read a > >>>> given > >>>>>>> selection of row groups and columns is issued from a single main > >>>> thread. > >>>>>>> Note that this does require that all keys required for a read are > >>>>>> assembled > >>>>>>> on the main thread so that DecryptionKeyRetriever objects are not > >>>>>> directly > >>>>>>> accessing any caches" > >>>>>>> > >>>>>>> The current PR code doesn't require a single main thread. Any > >> thread > >>>> can > >>>>>>> read any file, both footer and pages. So the key cache is shared, > >> to > >>>> save > >>>>>>> N-1 interactions with the KMS server. > >>>>>> > >>>>>> I don't think there's any contention on this. IMHO the only > >> concerns > >>>>>> are about the implementation, not the semantics. > >>>>>> > >>>>>> Best regards > >>>>>> > >>>>>> Antoine. > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Cheers, Gidon > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou < > >> anto...@python.org> > >>>>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> I'm not sure a threading model is expected for an encryption > >> layer. > >>>> Am > >>>>>>>> I missing something? > >>>>>>>> > >>>>>>>> Regards > >>>>>>>> > >>>>>>>> Antoine. > >>>>>>>> > >>>>>>>> > >>>>>>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : > >>>>>>>>> Precisely, the main change is in the threading model. Afaik, the > >>>>>> document > >>>>>>>>> proposes a model that fits pandas, but might be problematic for > >>> other > >>>>>>>> users > >>>>>>>>> of this library. > >>>>>>>>> Technically, this is not showstopper though; if the community > >>> decides > >>>>>> on > >>>>>>>>> this model, it will be compatible with the high-level encryption > >>>>>> design; > >>>>>>>>> but the change implementation would need to be done by pandas > >>> experts > >>>>>>>> (not > >>>>>>>>> us; but we'll help where we can). > >>>>>>>>> Micah, you know this subject (and the community) better than we > >> do > >>> - > >>>>>> we'd > >>>>>>>>> much appreciate it if you'd take a lead on removing this > >> roadblock. > >>>>>>>>> > >>>>>>>>> Cheers, Gidon > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield < > >>>> emkornfi...@gmail.com > >>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> I think some of the comments might be conflicting. One of the > >>>>>> concerns > >>>>>>>>>> (that I would need to refresh myself on to offer an opinion > >> which > >>>> was > >>>>>>>>>> covered in Ben's doc) was the threading model we expect in the > >>>>>> library. > >>>>>>>>>> > >>>>>>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou < > >>> anto...@python.org> > >>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Hi Gidon, > >>>>>>>>>>> > >>>>>>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : > >>>>>>>>>>>> Regarding the high-level layer, I think it waits for a > >> progress > >>> at > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >>> > >> > https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing > >>>>>>>>>>>> No activity there since last November. This is unfortunate, > >>>> because > >>>>>>>>>> Tham > >>>>>>>>>>>> has put a lot of work in coding the high-level layer (and > >>>> addressing > >>>>>>>>>> 200+ > >>>>>>>>>>>> review comments) in the PR > >>>>>> https://github.com/apache/arrow/pull/8023. > >>>>>>>>>>> The > >>>>>>>>>>>> code is functional, compatible with the Java version in > >>>> parquet-mr, > >>>>>>>> and > >>>>>>>>>>> can > >>>>>>>>>>>> be updated with the threading changes in the doc above. I hope > >>> all > >>>>>>>> this > >>>>>>>>>>>> good work will not be wasted. > >>>>>>>>>>> > >>>>>>>>>>> I'm sorry for the possibly frustrating process. Looking at the > >>> PR, > >>>>>>>>>>> though, it seems a bunch of comments were not addressed. Is it > >>>>>>>> possible > >>>>>>>>>>> to go through them and ensure they get an answer and/or a > >>>> resolution? > >>>>>>>>>>> > >>>>>>>>>>> Best regards > >>>>>>>>>>> > >>>>>>>>>>> Antoine. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Cheers, Gidon > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield < > >>>>>>>> emkornfi...@gmail.com > >>>>>>>>>>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> My thoughts: > >>>>>>>>>>>>> 1. I've lost track of the higher level encryption > >>> implementation > >>>>>> in > >>>>>>>>>>> C++. > >>>>>>>>>>>>> I think we were trying to come to a consensus on the > >>>>>> threading/thread > >>>>>>>>>>>>> safety model? > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2. I'm open to exposing the lower level encryption libraries > >>> in > >>>>>>>>>> python > >>>>>>>>>>>>> (without appropriate namespacing/communication). It seems at > >>>> least > >>>>>>>>>> for > >>>>>>>>>>>>> reading, there is potentially less harm (I'll caveat that > >> with > >>>> I'm > >>>>>>>>>> not a > >>>>>>>>>>>>> security expert). Are both the low level read and write > >>>>>>>>>> implementations > >>>>>>>>>>>>> necessary? (it probably makes sense to have a few smaller > >> PRs > >>>> for > >>>>>>>>>>> exposing > >>>>>>>>>>>>> this functionality anyways). > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < > >>>>>>>>>>>>> ita...@pythonspeed.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Since the PR for high-level C++ Parquet encryption API > >> appears > >>>>>>>>>> stalled > >>>>>>>>>>> ( > >>>>>>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking > >> into > >>>>>>>>>> exposing > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> low-level Parquet encryption API to Python. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Arguments for doing this: the low-level API is all the users > >>> I'm > >>>>>>>>>>> talking > >>>>>>>>>>>>>> to need, at the moment, so it's plausible others would also > >>> find > >>>>>>>> some > >>>>>>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet > >>>>>>>>>> encryption. > >>>>>>>>>>>>> Then > >>>>>>>>>>>>>> again, it might only be this one company and no one else > >>> cares. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The arguments against, per Gidon Gershinsky: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> * security: low-level encryption API is easy to misuse (eg > >>>>>> giving > >>>>>>>>>> the > >>>>>>>>>>>>>> same keys for a number of different files; this'd break the > >>> AES > >>>>>> GCM > >>>>>>>>>>>>>> cipher). The high-level encryption layer handles that by > >>>> applying > >>>>>>>>>>>>> envelope > >>>>>>>>>>>>>> encryption and other best practices in data security. Also, > >>> this > >>>>>>>>>> layer > >>>>>>>>>>> is > >>>>>>>>>>>>>> maintained by the community, meaning that future > >> improvements > >>>> and > >>>>>>>>>>>>> security > >>>>>>>>>>>>>> fixes can be upstreamed by anyone, and available to all. > >>>>>>>>>>>>>>> * compatibility: parquet-mr implements the high-level > >>>> encryption > >>>>>>>>>>>>> layer. > >>>>>>>>>>>>>> If we want the files produced by Spark/Presto/etc to be > >>> readable > >>>>>> by > >>>>>>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the > >> Arrow > >>>>>> users > >>>>>>>>>>> with > >>>>>>>>>>>>>> the high-level API. > >>>>>>>>>>>>>>> ... > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The current situation is not ideal, it'd be good to merge > >> the > >>>>>>>>>>>>> high-level > >>>>>>>>>>>>>> PR (and maybe hide the low level), but here we are; also, > >> C++ > >>>> is a > >>>>>>>>>> kind > >>>>>>>>>>>>> of > >>>>>>>>>>>>>> a low-level language; Python would expose it to a less > >>>> experienced > >>>>>>>>>>>>> audience. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I find the compatibility argument less compelling, that's > >>>> readily > >>>>>>>>>>>>>> addressed by documentation. I am not a crypto expert so I > >>> can't > >>>>>>>>>>> evaluate > >>>>>>>>>>>>>> how risky exposing the low-level encryption APIs would be, > >>> but I > >>>>>> can > >>>>>>>>>>> see > >>>>>>>>>>>>>> how that would be a significant concern. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Some options are: > >>>>>>>>>>>>>> * Status quo, no Python API for low-level Parquet > >> encryption. > >>>>>> This > >>>>>>>>>> has > >>>>>>>>>>>>>> two possible outcomes: > >>>>>>>>>>>>>> * Eventually high-level API gets merged, gets Python > >>> binding. > >>>>>>>>>>>>>> * High-level encryption API is never merged, Python users > >>>> never > >>>>>>>>>> get > >>>>>>>>>>>>>> access to encryption. > >>>>>>>>>>>>>> * Add low-level Parquet encryption API to Pyarrow, perhaps > >>>> using > >>>>>>>>>>>>> "hazmat" > >>>>>>>>>>>>>> idiom used by the Python cryptography package (API namespace > >>>>>>>>>> indicating > >>>>>>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g. > >>>>>>>>>>>>>> > >>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). > >>>>>>>>>>>>>> * Gidon Gershinsky did not find this suggestion > >> compelling > >>>>>> enough > >>>>>>>>>> to > >>>>>>>>>>>>>> override his security concerns. > >>>>>>>>>>>>>> * Low-level encryption done as third party Python package, > >>>> either > >>>>>>>>>>>>> private > >>>>>>>>>>>>>> or open source. This is annoying technically, plausibly > >> would > >>>>>>>> require > >>>>>>>>>>>>>> maintaining a fork. > >>>>>>>>>>>>>> Any other ideas? Thoughts on these options? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> —Itamar > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > >