>
> I don't think any notion of threading should be present in the
> implementation, except for the required locks around shared structures.


I seem to recall the debate was how to model some class interactions to
determine what should be considered shared structures and what should not.

On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com> wrote:

> This certainly sounds good to me.
>
> Cheers, Gidon
>
>
> On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > I don't think any notion of threading should be present in the
> > implementation, except for the required locks around shared structures.
> >  I don't know where the idea of a "main thread" comes from, but it
> > probably shouldn't exist in a C++ library.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
> > > Just to clarify. There are two options, which one do you refer to? A
> > design
> > > with a main thread that handles projections and the keys (relevant for
> > the
> > > projected columns); or the current code with any thread allowed to
> handle
> > > full file reading, inc the footer, column projections and their keys?
> Can
> > > you finalize this with Micah?
> > > The good news is, Tham is still interested to resume this work, and is
> ok
> > > with either option. Please let her know whether the current threading
> > model
> > > stays, or should be modified with the changes proposed in the doc (for
> > the
> > > latter, some guidance with the details would be needed).
> > >
> > > Cheers, Gidon
> > >
> > >
> > > On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >
> > >>
> > >>
> > >> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
> > >>> From the doc,
> > >>> "To maintain consistency with the style of parquet-cpp, the above
> > >>> structures should not be explicitly synchronized with individual
> > mutexes.
> > >>> In the case of a parquet::arrow::FileReader, the request to read a
> > given
> > >>> selection of row groups and columns is issued from a single main
> > thread.
> > >>> Note that this does require that all keys required for a read are
> > >> assembled
> > >>> on the main thread so that DecryptionKeyRetriever objects are not
> > >> directly
> > >>> accessing any caches"
> > >>>
> > >>> The current PR code doesn't require a single main thread. Any thread
> > can
> > >>> read any file, both footer and pages. So the key cache is shared, to
> > save
> > >>> N-1 interactions with the KMS server.
> > >>
> > >> I don't think there's any contention on this.  IMHO the only concerns
> > >> are about the implementation, not the semantics.
> > >>
> > >> Best regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >>>
> > >>> Cheers, Gidon
> > >>>
> > >>>
> > >>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org>
> > >> wrote:
> > >>>
> > >>>>
> > >>>> I'm not sure a threading model is expected for an encryption layer.
> > Am
> > >>>> I missing something?
> > >>>>
> > >>>> Regards
> > >>>>
> > >>>> Antoine.
> > >>>>
> > >>>>
> > >>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> > >>>>> Precisely, the main change is in the threading model. Afaik, the
> > >> document
> > >>>>> proposes a model that fits pandas, but might be problematic for
> other
> > >>>> users
> > >>>>> of this library.
> > >>>>> Technically, this is not showstopper though; if the community
> decides
> > >> on
> > >>>>> this model, it will be compatible with the high-level encryption
> > >> design;
> > >>>>> but the change implementation would need to be done by pandas
> experts
> > >>>> (not
> > >>>>> us; but we'll help where we can).
> > >>>>> Micah, you know this subject (and the community) better than we do
> -
> > >> we'd
> > >>>>> much appreciate it if you'd take a lead on removing this roadblock.
> > >>>>>
> > >>>>> Cheers, Gidon
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <
> > emkornfi...@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I think some of the comments might be conflicting.  One of the
> > >> concerns
> > >>>>>> (that I would need to refresh myself on to offer an opinion which
> > was
> > >>>>>> covered in Ben's doc) was the threading model we expect in the
> > >> library.
> > >>>>>>
> > >>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <
> anto...@python.org>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Hi Gidon,
> > >>>>>>>
> > >>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> > >>>>>>>> Regarding the high-level layer, I think it waits for a progress
> at
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> > >>>>>>>> No activity there since last November. This is unfortunate,
> > because
> > >>>>>> Tham
> > >>>>>>>> has put a lot of work in coding the high-level layer (and
> > addressing
> > >>>>>> 200+
> > >>>>>>>> review comments) in the PR
> > >> https://github.com/apache/arrow/pull/8023.
> > >>>>>>> The
> > >>>>>>>> code is functional, compatible with the Java version in
> > parquet-mr,
> > >>>> and
> > >>>>>>> can
> > >>>>>>>> be updated with the threading changes in the doc above. I hope
> all
> > >>>> this
> > >>>>>>>> good work will not be wasted.
> > >>>>>>>
> > >>>>>>> I'm sorry for the possibly frustrating process.  Looking at the
> PR,
> > >>>>>>> though, it seems a bunch of comments were not addressed.  Is it
> > >>>> possible
> > >>>>>>> to go through them and ensure they get an answer and/or a
> > resolution?
> > >>>>>>>
> > >>>>>>> Best regards
> > >>>>>>>
> > >>>>>>> Antoine.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>> Cheers, Gidon
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
> > >>>> emkornfi...@gmail.com
> > >>>>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> My thoughts:
> > >>>>>>>>> 1.  I've lost track of the higher level encryption
> implementation
> > >> in
> > >>>>>>> C++.
> > >>>>>>>>> I think we were trying to come to a consensus on the
> > >> threading/thread
> > >>>>>>>>> safety model?
> > >>>>>>>>>
> > >>>>>>>>> 2.  I'm open to exposing the lower level encryption libraries
> in
> > >>>>>> python
> > >>>>>>>>> (without appropriate namespacing/communication).  It seems at
> > least
> > >>>>>> for
> > >>>>>>>>> reading, there is potentially less harm (I'll caveat that with
> > I'm
> > >>>>>> not a
> > >>>>>>>>> security expert).  Are both the low level read and write
> > >>>>>> implementations
> > >>>>>>>>> necessary?  (it probably makes sense to have a few smaller PRs
> > for
> > >>>>>>> exposing
> > >>>>>>>>> this functionality anyways).
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
> > >>>>>>>>> ita...@pythonspeed.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> Since the PR for high-level C++ Parquet encryption API appears
> > >>>>>> stalled
> > >>>>>>> (
> > >>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into
> > >>>>>> exposing
> > >>>>>>>>> the
> > >>>>>>>>>> low-level Parquet encryption API to Python.
> > >>>>>>>>>>
> > >>>>>>>>>> Arguments for doing this: the low-level API is all the users
> I'm
> > >>>>>>> talking
> > >>>>>>>>>> to need, at the moment, so it's plausible others would also
> find
> > >>>> some
> > >>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet
> > >>>>>> encryption.
> > >>>>>>>>> Then
> > >>>>>>>>>> again, it might only be this one company and no one else
> cares.
> > >>>>>>>>>>
> > >>>>>>>>>> The arguments against, per Gidon Gershinsky:
> > >>>>>>>>>>
> > >>>>>>>>>>>  * security: low-level encryption API is easy to misuse (eg
> > >> giving
> > >>>>>> the
> > >>>>>>>>>> same keys for a number of different files; this'd break the
> AES
> > >> GCM
> > >>>>>>>>>> cipher). The high-level encryption layer handles that by
> > applying
> > >>>>>>>>> envelope
> > >>>>>>>>>> encryption and other best practices in data security. Also,
> this
> > >>>>>> layer
> > >>>>>>> is
> > >>>>>>>>>> maintained by the community, meaning that future improvements
> > and
> > >>>>>>>>> security
> > >>>>>>>>>> fixes can be upstreamed by anyone, and available to all.
> > >>>>>>>>>>>  * compatibility: parquet-mr implements the high-level
> > encryption
> > >>>>>>>>> layer.
> > >>>>>>>>>> If we want the files produced by Spark/Presto/etc to be
> readable
> > >> by
> > >>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow
> > >> users
> > >>>>>>> with
> > >>>>>>>>>> the high-level API.
> > >>>>>>>>>>> ...
> > >>>>>>>>>>>
> > >>>>>>>>>>> The current situation is not ideal, it'd be good to merge the
> > >>>>>>>>> high-level
> > >>>>>>>>>> PR (and maybe hide the low level), but here we are; also, C++
> > is a
> > >>>>>> kind
> > >>>>>>>>> of
> > >>>>>>>>>> a low-level language; Python would expose it to a less
> > experienced
> > >>>>>>>>> audience.
> > >>>>>>>>>>
> > >>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
> > >>>>>>>>>>
> > >>>>>>>>>> I find the compatibility argument less compelling, that's
> > readily
> > >>>>>>>>>> addressed by documentation. I am not a crypto expert so I
> can't
> > >>>>>>> evaluate
> > >>>>>>>>>> how risky exposing the low-level encryption APIs would be,
> but I
> > >> can
> > >>>>>>> see
> > >>>>>>>>>> how that would be a significant concern.
> > >>>>>>>>>>
> > >>>>>>>>>> Some options are:
> > >>>>>>>>>>  * Status quo, no Python API for low-level Parquet encryption.
> > >> This
> > >>>>>> has
> > >>>>>>>>>> two possible outcomes:
> > >>>>>>>>>>    * Eventually high-level API gets merged, gets Python
> binding.
> > >>>>>>>>>>    * High-level encryption API is never merged, Python users
> > never
> > >>>>>> get
> > >>>>>>>>>> access to encryption.
> > >>>>>>>>>>  * Add low-level Parquet encryption API to Pyarrow, perhaps
> > using
> > >>>>>>>>> "hazmat"
> > >>>>>>>>>> idiom used by the Python cryptography package (API namespace
> > >>>>>> indicating
> > >>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g.
> > >>>>>>>>>>
> > `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
> > >>>>>>>>>>    * Gidon Gershinsky did not find this suggestion compelling
> > >> enough
> > >>>>>> to
> > >>>>>>>>>> override his security concerns.
> > >>>>>>>>>>  * Low-level encryption done as third party Python package,
> > either
> > >>>>>>>>> private
> > >>>>>>>>>> or open source. This is annoying technically, plausibly would
> > >>>> require
> > >>>>>>>>>> maintaining a fork.
> > >>>>>>>>>> Any other ideas? Thoughts on these options?
> > >>>>>>>>>>
> > >>>>>>>>>> —Itamar
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> >
>

Reply via email to