I don't think there's any concern around having a process-global shared key cache. The discussion was just around the implementation.
Also, FTR, a standalone LRU cache class is proposed here, which may reduce the amount of original code in the Parquet encryption PR: https://github.com/apache/arrow/pull/8716 Best regards Antoine. Le 18/02/2021 à 16:40, Gidon Gershinsky a écrit : > I believe the shared structures that were debated are the key caches. > > Cheers, Gidon > > > On Thu, Feb 18, 2021 at 6:37 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >>> >>> I don't think any notion of threading should be present in the >>> implementation, except for the required locks around shared structures. >> >> >> I seem to recall the debate was how to model some class interactions to >> determine what should be considered shared structures and what should not. >> >> On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com> wrote: >> >>> This certainly sounds good to me. >>> >>> Cheers, Gidon >>> >>> >>> On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org> >> wrote: >>> >>>> >>>> I don't think any notion of threading should be present in the >>>> implementation, except for the required locks around shared structures. >>>> I don't know where the idea of a "main thread" comes from, but it >>>> probably shouldn't exist in a C++ library. >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>> >>>> Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit : >>>>> Just to clarify. There are two options, which one do you refer to? A >>>> design >>>>> with a main thread that handles projections and the keys (relevant >> for >>>> the >>>>> projected columns); or the current code with any thread allowed to >>> handle >>>>> full file reading, inc the footer, column projections and their keys? >>> Can >>>>> you finalize this with Micah? >>>>> The good news is, Tham is still interested to resume this work, and >> is >>> ok >>>>> with either option. Please let her know whether the current threading >>>> model >>>>> stays, or should be modified with the changes proposed in the doc >> (for >>>> the >>>>> latter, some guidance with the details would be needed). >>>>> >>>>> Cheers, Gidon >>>>> >>>>> >>>>> On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org> >>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit : >>>>>>> From the doc, >>>>>>> "To maintain consistency with the style of parquet-cpp, the above >>>>>>> structures should not be explicitly synchronized with individual >>>> mutexes. >>>>>>> In the case of a parquet::arrow::FileReader, the request to read a >>>> given >>>>>>> selection of row groups and columns is issued from a single main >>>> thread. >>>>>>> Note that this does require that all keys required for a read are >>>>>> assembled >>>>>>> on the main thread so that DecryptionKeyRetriever objects are not >>>>>> directly >>>>>>> accessing any caches" >>>>>>> >>>>>>> The current PR code doesn't require a single main thread. Any >> thread >>>> can >>>>>>> read any file, both footer and pages. So the key cache is shared, >> to >>>> save >>>>>>> N-1 interactions with the KMS server. >>>>>> >>>>>> I don't think there's any contention on this. IMHO the only >> concerns >>>>>> are about the implementation, not the semantics. >>>>>> >>>>>> Best regards >>>>>> >>>>>> Antoine. >>>>>> >>>>>> >>>>>>> >>>>>>> Cheers, Gidon >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou < >> anto...@python.org> >>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> I'm not sure a threading model is expected for an encryption >> layer. >>>> Am >>>>>>>> I missing something? >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Antoine. >>>>>>>> >>>>>>>> >>>>>>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : >>>>>>>>> Precisely, the main change is in the threading model. Afaik, the >>>>>> document >>>>>>>>> proposes a model that fits pandas, but might be problematic for >>> other >>>>>>>> users >>>>>>>>> of this library. >>>>>>>>> Technically, this is not showstopper though; if the community >>> decides >>>>>> on >>>>>>>>> this model, it will be compatible with the high-level encryption >>>>>> design; >>>>>>>>> but the change implementation would need to be done by pandas >>> experts >>>>>>>> (not >>>>>>>>> us; but we'll help where we can). >>>>>>>>> Micah, you know this subject (and the community) better than we >> do >>> - >>>>>> we'd >>>>>>>>> much appreciate it if you'd take a lead on removing this >> roadblock. >>>>>>>>> >>>>>>>>> Cheers, Gidon >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield < >>>> emkornfi...@gmail.com >>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I think some of the comments might be conflicting. One of the >>>>>> concerns >>>>>>>>>> (that I would need to refresh myself on to offer an opinion >> which >>>> was >>>>>>>>>> covered in Ben's doc) was the threading model we expect in the >>>>>> library. >>>>>>>>>> >>>>>>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou < >>> anto...@python.org> >>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Gidon, >>>>>>>>>>> >>>>>>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : >>>>>>>>>>>> Regarding the high-level layer, I think it waits for a >> progress >>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >>> >> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing >>>>>>>>>>>> No activity there since last November. This is unfortunate, >>>> because >>>>>>>>>> Tham >>>>>>>>>>>> has put a lot of work in coding the high-level layer (and >>>> addressing >>>>>>>>>> 200+ >>>>>>>>>>>> review comments) in the PR >>>>>> https://github.com/apache/arrow/pull/8023. >>>>>>>>>>> The >>>>>>>>>>>> code is functional, compatible with the Java version in >>>> parquet-mr, >>>>>>>> and >>>>>>>>>>> can >>>>>>>>>>>> be updated with the threading changes in the doc above. I hope >>> all >>>>>>>> this >>>>>>>>>>>> good work will not be wasted. >>>>>>>>>>> >>>>>>>>>>> I'm sorry for the possibly frustrating process. Looking at the >>> PR, >>>>>>>>>>> though, it seems a bunch of comments were not addressed. Is it >>>>>>>> possible >>>>>>>>>>> to go through them and ensure they get an answer and/or a >>>> resolution? >>>>>>>>>>> >>>>>>>>>>> Best regards >>>>>>>>>>> >>>>>>>>>>> Antoine. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Cheers, Gidon >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield < >>>>>>>> emkornfi...@gmail.com >>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> My thoughts: >>>>>>>>>>>>> 1. I've lost track of the higher level encryption >>> implementation >>>>>> in >>>>>>>>>>> C++. >>>>>>>>>>>>> I think we were trying to come to a consensus on the >>>>>> threading/thread >>>>>>>>>>>>> safety model? >>>>>>>>>>>>> >>>>>>>>>>>>> 2. I'm open to exposing the lower level encryption libraries >>> in >>>>>>>>>> python >>>>>>>>>>>>> (without appropriate namespacing/communication). It seems at >>>> least >>>>>>>>>> for >>>>>>>>>>>>> reading, there is potentially less harm (I'll caveat that >> with >>>> I'm >>>>>>>>>> not a >>>>>>>>>>>>> security expert). Are both the low level read and write >>>>>>>>>> implementations >>>>>>>>>>>>> necessary? (it probably makes sense to have a few smaller >> PRs >>>> for >>>>>>>>>>> exposing >>>>>>>>>>>>> this functionality anyways). >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < >>>>>>>>>>>>> ita...@pythonspeed.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Since the PR for high-level C++ Parquet encryption API >> appears >>>>>>>>>> stalled >>>>>>>>>>> ( >>>>>>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking >> into >>>>>>>>>> exposing >>>>>>>>>>>>> the >>>>>>>>>>>>>> low-level Parquet encryption API to Python. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Arguments for doing this: the low-level API is all the users >>> I'm >>>>>>>>>>> talking >>>>>>>>>>>>>> to need, at the moment, so it's plausible others would also >>> find >>>>>>>> some >>>>>>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet >>>>>>>>>> encryption. >>>>>>>>>>>>> Then >>>>>>>>>>>>>> again, it might only be this one company and no one else >>> cares. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The arguments against, per Gidon Gershinsky: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> * security: low-level encryption API is easy to misuse (eg >>>>>> giving >>>>>>>>>> the >>>>>>>>>>>>>> same keys for a number of different files; this'd break the >>> AES >>>>>> GCM >>>>>>>>>>>>>> cipher). The high-level encryption layer handles that by >>>> applying >>>>>>>>>>>>> envelope >>>>>>>>>>>>>> encryption and other best practices in data security. Also, >>> this >>>>>>>>>> layer >>>>>>>>>>> is >>>>>>>>>>>>>> maintained by the community, meaning that future >> improvements >>>> and >>>>>>>>>>>>> security >>>>>>>>>>>>>> fixes can be upstreamed by anyone, and available to all. >>>>>>>>>>>>>>> * compatibility: parquet-mr implements the high-level >>>> encryption >>>>>>>>>>>>> layer. >>>>>>>>>>>>>> If we want the files produced by Spark/Presto/etc to be >>> readable >>>>>> by >>>>>>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the >> Arrow >>>>>> users >>>>>>>>>>> with >>>>>>>>>>>>>> the high-level API. >>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The current situation is not ideal, it'd be good to merge >> the >>>>>>>>>>>>> high-level >>>>>>>>>>>>>> PR (and maybe hide the low level), but here we are; also, >> C++ >>>> is a >>>>>>>>>> kind >>>>>>>>>>>>> of >>>>>>>>>>>>>> a low-level language; Python would expose it to a less >>>> experienced >>>>>>>>>>>>> audience. >>>>>>>>>>>>>> >>>>>>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) >>>>>>>>>>>>>> >>>>>>>>>>>>>> I find the compatibility argument less compelling, that's >>>> readily >>>>>>>>>>>>>> addressed by documentation. I am not a crypto expert so I >>> can't >>>>>>>>>>> evaluate >>>>>>>>>>>>>> how risky exposing the low-level encryption APIs would be, >>> but I >>>>>> can >>>>>>>>>>> see >>>>>>>>>>>>>> how that would be a significant concern. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Some options are: >>>>>>>>>>>>>> * Status quo, no Python API for low-level Parquet >> encryption. >>>>>> This >>>>>>>>>> has >>>>>>>>>>>>>> two possible outcomes: >>>>>>>>>>>>>> * Eventually high-level API gets merged, gets Python >>> binding. >>>>>>>>>>>>>> * High-level encryption API is never merged, Python users >>>> never >>>>>>>>>> get >>>>>>>>>>>>>> access to encryption. >>>>>>>>>>>>>> * Add low-level Parquet encryption API to Pyarrow, perhaps >>>> using >>>>>>>>>>>>> "hazmat" >>>>>>>>>>>>>> idiom used by the Python cryptography package (API namespace >>>>>>>>>> indicating >>>>>>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g. >>>>>>>>>>>>>> >>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). >>>>>>>>>>>>>> * Gidon Gershinsky did not find this suggestion >> compelling >>>>>> enough >>>>>>>>>> to >>>>>>>>>>>>>> override his security concerns. >>>>>>>>>>>>>> * Low-level encryption done as third party Python package, >>>> either >>>>>>>>>>>>> private >>>>>>>>>>>>>> or open source. This is annoying technically, plausibly >> would >>>>>>>> require >>>>>>>>>>>>>> maintaining a fork. >>>>>>>>>>>>>> Any other ideas? Thoughts on these options? >>>>>>>>>>>>>> >>>>>>>>>>>>>> —Itamar >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >