> > I don't think any notion of threading should be present in the > implementation, except for the required locks around shared structures.
I seem to recall the debate was how to model some class interactions to determine what should be considered shared structures and what should not. On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com> wrote: > This certainly sounds good to me. > > Cheers, Gidon > > > On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > I don't think any notion of threading should be present in the > > implementation, except for the required locks around shared structures. > > I don't know where the idea of a "main thread" comes from, but it > > probably shouldn't exist in a C++ library. > > > > Regards > > > > Antoine. > > > > > > > > Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit : > > > Just to clarify. There are two options, which one do you refer to? A > > design > > > with a main thread that handles projections and the keys (relevant for > > the > > > projected columns); or the current code with any thread allowed to > handle > > > full file reading, inc the footer, column projections and their keys? > Can > > > you finalize this with Micah? > > > The good news is, Tham is still interested to resume this work, and is > ok > > > with either option. Please let her know whether the current threading > > model > > > stays, or should be modified with the changes proposed in the doc (for > > the > > > latter, some guidance with the details would be needed). > > > > > > Cheers, Gidon > > > > > > > > > On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > >> > > >> > > >> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit : > > >>> From the doc, > > >>> "To maintain consistency with the style of parquet-cpp, the above > > >>> structures should not be explicitly synchronized with individual > > mutexes. > > >>> In the case of a parquet::arrow::FileReader, the request to read a > > given > > >>> selection of row groups and columns is issued from a single main > > thread. > > >>> Note that this does require that all keys required for a read are > > >> assembled > > >>> on the main thread so that DecryptionKeyRetriever objects are not > > >> directly > > >>> accessing any caches" > > >>> > > >>> The current PR code doesn't require a single main thread. Any thread > > can > > >>> read any file, both footer and pages. So the key cache is shared, to > > save > > >>> N-1 interactions with the KMS server. > > >> > > >> I don't think there's any contention on this. IMHO the only concerns > > >> are about the implementation, not the semantics. > > >> > > >> Best regards > > >> > > >> Antoine. > > >> > > >> > > >>> > > >>> Cheers, Gidon > > >>> > > >>> > > >>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org> > > >> wrote: > > >>> > > >>>> > > >>>> I'm not sure a threading model is expected for an encryption layer. > > Am > > >>>> I missing something? > > >>>> > > >>>> Regards > > >>>> > > >>>> Antoine. > > >>>> > > >>>> > > >>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : > > >>>>> Precisely, the main change is in the threading model. Afaik, the > > >> document > > >>>>> proposes a model that fits pandas, but might be problematic for > other > > >>>> users > > >>>>> of this library. > > >>>>> Technically, this is not showstopper though; if the community > decides > > >> on > > >>>>> this model, it will be compatible with the high-level encryption > > >> design; > > >>>>> but the change implementation would need to be done by pandas > experts > > >>>> (not > > >>>>> us; but we'll help where we can). > > >>>>> Micah, you know this subject (and the community) better than we do > - > > >> we'd > > >>>>> much appreciate it if you'd take a lead on removing this roadblock. > > >>>>> > > >>>>> Cheers, Gidon > > >>>>> > > >>>>> > > >>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield < > > emkornfi...@gmail.com > > >>> > > >>>>> wrote: > > >>>>> > > >>>>>> I think some of the comments might be conflicting. One of the > > >> concerns > > >>>>>> (that I would need to refresh myself on to offer an opinion which > > was > > >>>>>> covered in Ben's doc) was the threading model we expect in the > > >> library. > > >>>>>> > > >>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou < > anto...@python.org> > > >>>> wrote: > > >>>>>> > > >>>>>>> > > >>>>>>> Hi Gidon, > > >>>>>>> > > >>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : > > >>>>>>>> Regarding the high-level layer, I think it waits for a progress > at > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing > > >>>>>>>> No activity there since last November. This is unfortunate, > > because > > >>>>>> Tham > > >>>>>>>> has put a lot of work in coding the high-level layer (and > > addressing > > >>>>>> 200+ > > >>>>>>>> review comments) in the PR > > >> https://github.com/apache/arrow/pull/8023. > > >>>>>>> The > > >>>>>>>> code is functional, compatible with the Java version in > > parquet-mr, > > >>>> and > > >>>>>>> can > > >>>>>>>> be updated with the threading changes in the doc above. I hope > all > > >>>> this > > >>>>>>>> good work will not be wasted. > > >>>>>>> > > >>>>>>> I'm sorry for the possibly frustrating process. Looking at the > PR, > > >>>>>>> though, it seems a bunch of comments were not addressed. Is it > > >>>> possible > > >>>>>>> to go through them and ensure they get an answer and/or a > > resolution? > > >>>>>>> > > >>>>>>> Best regards > > >>>>>>> > > >>>>>>> Antoine. > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>>> > > >>>>>>>> Cheers, Gidon > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield < > > >>>> emkornfi...@gmail.com > > >>>>>>> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> My thoughts: > > >>>>>>>>> 1. I've lost track of the higher level encryption > implementation > > >> in > > >>>>>>> C++. > > >>>>>>>>> I think we were trying to come to a consensus on the > > >> threading/thread > > >>>>>>>>> safety model? > > >>>>>>>>> > > >>>>>>>>> 2. I'm open to exposing the lower level encryption libraries > in > > >>>>>> python > > >>>>>>>>> (without appropriate namespacing/communication). It seems at > > least > > >>>>>> for > > >>>>>>>>> reading, there is potentially less harm (I'll caveat that with > > I'm > > >>>>>> not a > > >>>>>>>>> security expert). Are both the low level read and write > > >>>>>> implementations > > >>>>>>>>> necessary? (it probably makes sense to have a few smaller PRs > > for > > >>>>>>> exposing > > >>>>>>>>> this functionality anyways). > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < > > >>>>>>>>> ita...@pythonspeed.com> wrote: > > >>>>>>>>> > > >>>>>>>>>> Hi, > > >>>>>>>>>> > > >>>>>>>>>> Since the PR for high-level C++ Parquet encryption API appears > > >>>>>> stalled > > >>>>>>> ( > > >>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into > > >>>>>> exposing > > >>>>>>>>> the > > >>>>>>>>>> low-level Parquet encryption API to Python. > > >>>>>>>>>> > > >>>>>>>>>> Arguments for doing this: the low-level API is all the users > I'm > > >>>>>>> talking > > >>>>>>>>>> to need, at the moment, so it's plausible others would also > find > > >>>> some > > >>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet > > >>>>>> encryption. > > >>>>>>>>> Then > > >>>>>>>>>> again, it might only be this one company and no one else > cares. > > >>>>>>>>>> > > >>>>>>>>>> The arguments against, per Gidon Gershinsky: > > >>>>>>>>>> > > >>>>>>>>>>> * security: low-level encryption API is easy to misuse (eg > > >> giving > > >>>>>> the > > >>>>>>>>>> same keys for a number of different files; this'd break the > AES > > >> GCM > > >>>>>>>>>> cipher). The high-level encryption layer handles that by > > applying > > >>>>>>>>> envelope > > >>>>>>>>>> encryption and other best practices in data security. Also, > this > > >>>>>> layer > > >>>>>>> is > > >>>>>>>>>> maintained by the community, meaning that future improvements > > and > > >>>>>>>>> security > > >>>>>>>>>> fixes can be upstreamed by anyone, and available to all. > > >>>>>>>>>>> * compatibility: parquet-mr implements the high-level > > encryption > > >>>>>>>>> layer. > > >>>>>>>>>> If we want the files produced by Spark/Presto/etc to be > readable > > >> by > > >>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow > > >> users > > >>>>>>> with > > >>>>>>>>>> the high-level API. > > >>>>>>>>>>> ... > > >>>>>>>>>>> > > >>>>>>>>>>> The current situation is not ideal, it'd be good to merge the > > >>>>>>>>> high-level > > >>>>>>>>>> PR (and maybe hide the low level), but here we are; also, C++ > > is a > > >>>>>> kind > > >>>>>>>>> of > > >>>>>>>>>> a low-level language; Python would expose it to a less > > experienced > > >>>>>>>>> audience. > > >>>>>>>>>> > > >>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) > > >>>>>>>>>> > > >>>>>>>>>> I find the compatibility argument less compelling, that's > > readily > > >>>>>>>>>> addressed by documentation. I am not a crypto expert so I > can't > > >>>>>>> evaluate > > >>>>>>>>>> how risky exposing the low-level encryption APIs would be, > but I > > >> can > > >>>>>>> see > > >>>>>>>>>> how that would be a significant concern. > > >>>>>>>>>> > > >>>>>>>>>> Some options are: > > >>>>>>>>>> * Status quo, no Python API for low-level Parquet encryption. > > >> This > > >>>>>> has > > >>>>>>>>>> two possible outcomes: > > >>>>>>>>>> * Eventually high-level API gets merged, gets Python > binding. > > >>>>>>>>>> * High-level encryption API is never merged, Python users > > never > > >>>>>> get > > >>>>>>>>>> access to encryption. > > >>>>>>>>>> * Add low-level Parquet encryption API to Pyarrow, perhaps > > using > > >>>>>>>>> "hazmat" > > >>>>>>>>>> idiom used by the Python cryptography package (API namespace > > >>>>>> indicating > > >>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g. > > >>>>>>>>>> > > `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). > > >>>>>>>>>> * Gidon Gershinsky did not find this suggestion compelling > > >> enough > > >>>>>> to > > >>>>>>>>>> override his security concerns. > > >>>>>>>>>> * Low-level encryption done as third party Python package, > > either > > >>>>>>>>> private > > >>>>>>>>>> or open source. This is annoying technically, plausibly would > > >>>> require > > >>>>>>>>>> maintaining a fork. > > >>>>>>>>>> Any other ideas? Thoughts on these options? > > >>>>>>>>>> > > >>>>>>>>>> —Itamar > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > > > >