I don't think there's any concern around having a process-global shared
key cache.  The discussion was just around the implementation.

Also, FTR, a standalone LRU cache class is proposed here, which may
reduce the amount of original code in the Parquet encryption PR:
https://github.com/apache/arrow/pull/8716

Best regards

Antoine.


Le 18/02/2021 à 16:40, Gidon Gershinsky a écrit :
> I believe the shared structures that were debated are the key caches.
> 
> Cheers, Gidon
> 
> 
> On Thu, Feb 18, 2021 at 6:37 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> 
>>>
>>> I don't think any notion of threading should be present in the
>>> implementation, except for the required locks around shared structures.
>>
>>
>> I seem to recall the debate was how to model some class interactions to
>> determine what should be considered shared structures and what should not.
>>
>> On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com> wrote:
>>
>>> This certainly sounds good to me.
>>>
>>> Cheers, Gidon
>>>
>>>
>>> On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org>
>> wrote:
>>>
>>>>
>>>> I don't think any notion of threading should be present in the
>>>> implementation, except for the required locks around shared structures.
>>>>  I don't know where the idea of a "main thread" comes from, but it
>>>> probably shouldn't exist in a C++ library.
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>>
>>>> Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
>>>>> Just to clarify. There are two options, which one do you refer to? A
>>>> design
>>>>> with a main thread that handles projections and the keys (relevant
>> for
>>>> the
>>>>> projected columns); or the current code with any thread allowed to
>>> handle
>>>>> full file reading, inc the footer, column projections and their keys?
>>> Can
>>>>> you finalize this with Micah?
>>>>> The good news is, Tham is still interested to resume this work, and
>> is
>>> ok
>>>>> with either option. Please let her know whether the current threading
>>>> model
>>>>> stays, or should be modified with the changes proposed in the doc
>> (for
>>>> the
>>>>> latter, some guidance with the details would be needed).
>>>>>
>>>>> Cheers, Gidon
>>>>>
>>>>>
>>>>> On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org>
>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
>>>>>>> From the doc,
>>>>>>> "To maintain consistency with the style of parquet-cpp, the above
>>>>>>> structures should not be explicitly synchronized with individual
>>>> mutexes.
>>>>>>> In the case of a parquet::arrow::FileReader, the request to read a
>>>> given
>>>>>>> selection of row groups and columns is issued from a single main
>>>> thread.
>>>>>>> Note that this does require that all keys required for a read are
>>>>>> assembled
>>>>>>> on the main thread so that DecryptionKeyRetriever objects are not
>>>>>> directly
>>>>>>> accessing any caches"
>>>>>>>
>>>>>>> The current PR code doesn't require a single main thread. Any
>> thread
>>>> can
>>>>>>> read any file, both footer and pages. So the key cache is shared,
>> to
>>>> save
>>>>>>> N-1 interactions with the KMS server.
>>>>>>
>>>>>> I don't think there's any contention on this.  IMHO the only
>> concerns
>>>>>> are about the implementation, not the semantics.
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> Antoine.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers, Gidon
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <
>> anto...@python.org>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I'm not sure a threading model is expected for an encryption
>> layer.
>>>> Am
>>>>>>>> I missing something?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Antoine.
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
>>>>>>>>> Precisely, the main change is in the threading model. Afaik, the
>>>>>> document
>>>>>>>>> proposes a model that fits pandas, but might be problematic for
>>> other
>>>>>>>> users
>>>>>>>>> of this library.
>>>>>>>>> Technically, this is not showstopper though; if the community
>>> decides
>>>>>> on
>>>>>>>>> this model, it will be compatible with the high-level encryption
>>>>>> design;
>>>>>>>>> but the change implementation would need to be done by pandas
>>> experts
>>>>>>>> (not
>>>>>>>>> us; but we'll help where we can).
>>>>>>>>> Micah, you know this subject (and the community) better than we
>> do
>>> -
>>>>>> we'd
>>>>>>>>> much appreciate it if you'd take a lead on removing this
>> roadblock.
>>>>>>>>>
>>>>>>>>> Cheers, Gidon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <
>>>> emkornfi...@gmail.com
>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think some of the comments might be conflicting.  One of the
>>>>>> concerns
>>>>>>>>>> (that I would need to refresh myself on to offer an opinion
>> which
>>>> was
>>>>>>>>>> covered in Ben's doc) was the threading model we expect in the
>>>>>> library.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <
>>> anto...@python.org>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Gidon,
>>>>>>>>>>>
>>>>>>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
>>>>>>>>>>>> Regarding the high-level layer, I think it waits for a
>> progress
>>> at
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
>>>>>>>>>>>> No activity there since last November. This is unfortunate,
>>>> because
>>>>>>>>>> Tham
>>>>>>>>>>>> has put a lot of work in coding the high-level layer (and
>>>> addressing
>>>>>>>>>> 200+
>>>>>>>>>>>> review comments) in the PR
>>>>>> https://github.com/apache/arrow/pull/8023.
>>>>>>>>>>> The
>>>>>>>>>>>> code is functional, compatible with the Java version in
>>>> parquet-mr,
>>>>>>>> and
>>>>>>>>>>> can
>>>>>>>>>>>> be updated with the threading changes in the doc above. I hope
>>> all
>>>>>>>> this
>>>>>>>>>>>> good work will not be wasted.
>>>>>>>>>>>
>>>>>>>>>>> I'm sorry for the possibly frustrating process.  Looking at the
>>> PR,
>>>>>>>>>>> though, it seems a bunch of comments were not addressed.  Is it
>>>>>>>> possible
>>>>>>>>>>> to go through them and ensure they get an answer and/or a
>>>> resolution?
>>>>>>>>>>>
>>>>>>>>>>> Best regards
>>>>>>>>>>>
>>>>>>>>>>> Antoine.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers, Gidon
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
>>>>>>>> emkornfi...@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> My thoughts:
>>>>>>>>>>>>> 1.  I've lost track of the higher level encryption
>>> implementation
>>>>>> in
>>>>>>>>>>> C++.
>>>>>>>>>>>>> I think we were trying to come to a consensus on the
>>>>>> threading/thread
>>>>>>>>>>>>> safety model?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2.  I'm open to exposing the lower level encryption libraries
>>> in
>>>>>>>>>> python
>>>>>>>>>>>>> (without appropriate namespacing/communication).  It seems at
>>>> least
>>>>>>>>>> for
>>>>>>>>>>>>> reading, there is potentially less harm (I'll caveat that
>> with
>>>> I'm
>>>>>>>>>> not a
>>>>>>>>>>>>> security expert).  Are both the low level read and write
>>>>>>>>>> implementations
>>>>>>>>>>>>> necessary?  (it probably makes sense to have a few smaller
>> PRs
>>>> for
>>>>>>>>>>> exposing
>>>>>>>>>>>>> this functionality anyways).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
>>>>>>>>>>>>> ita...@pythonspeed.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since the PR for high-level C++ Parquet encryption API
>> appears
>>>>>>>>>> stalled
>>>>>>>>>>> (
>>>>>>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking
>> into
>>>>>>>>>> exposing
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> low-level Parquet encryption API to Python.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Arguments for doing this: the low-level API is all the users
>>> I'm
>>>>>>>>>>> talking
>>>>>>>>>>>>>> to need, at the moment, so it's plausible others would also
>>> find
>>>>>>>> some
>>>>>>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet
>>>>>>>>>> encryption.
>>>>>>>>>>>>> Then
>>>>>>>>>>>>>> again, it might only be this one company and no one else
>>> cares.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The arguments against, per Gidon Gershinsky:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  * security: low-level encryption API is easy to misuse (eg
>>>>>> giving
>>>>>>>>>> the
>>>>>>>>>>>>>> same keys for a number of different files; this'd break the
>>> AES
>>>>>> GCM
>>>>>>>>>>>>>> cipher). The high-level encryption layer handles that by
>>>> applying
>>>>>>>>>>>>> envelope
>>>>>>>>>>>>>> encryption and other best practices in data security. Also,
>>> this
>>>>>>>>>> layer
>>>>>>>>>>> is
>>>>>>>>>>>>>> maintained by the community, meaning that future
>> improvements
>>>> and
>>>>>>>>>>>>> security
>>>>>>>>>>>>>> fixes can be upstreamed by anyone, and available to all.
>>>>>>>>>>>>>>>  * compatibility: parquet-mr implements the high-level
>>>> encryption
>>>>>>>>>>>>> layer.
>>>>>>>>>>>>>> If we want the files produced by Spark/Presto/etc to be
>>> readable
>>>>>> by
>>>>>>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the
>> Arrow
>>>>>> users
>>>>>>>>>>> with
>>>>>>>>>>>>>> the high-level API.
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The current situation is not ideal, it'd be good to merge
>> the
>>>>>>>>>>>>> high-level
>>>>>>>>>>>>>> PR (and maybe hide the low level), but here we are; also,
>> C++
>>>> is a
>>>>>>>>>> kind
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> a low-level language; Python would expose it to a less
>>>> experienced
>>>>>>>>>>>>> audience.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I find the compatibility argument less compelling, that's
>>>> readily
>>>>>>>>>>>>>> addressed by documentation. I am not a crypto expert so I
>>> can't
>>>>>>>>>>> evaluate
>>>>>>>>>>>>>> how risky exposing the low-level encryption APIs would be,
>>> but I
>>>>>> can
>>>>>>>>>>> see
>>>>>>>>>>>>>> how that would be a significant concern.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some options are:
>>>>>>>>>>>>>>  * Status quo, no Python API for low-level Parquet
>> encryption.
>>>>>> This
>>>>>>>>>> has
>>>>>>>>>>>>>> two possible outcomes:
>>>>>>>>>>>>>>    * Eventually high-level API gets merged, gets Python
>>> binding.
>>>>>>>>>>>>>>    * High-level encryption API is never merged, Python users
>>>> never
>>>>>>>>>> get
>>>>>>>>>>>>>> access to encryption.
>>>>>>>>>>>>>>  * Add low-level Parquet encryption API to Pyarrow, perhaps
>>>> using
>>>>>>>>>>>>> "hazmat"
>>>>>>>>>>>>>> idiom used by the Python cryptography package (API namespace
>>>>>>>>>> indicating
>>>>>>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g.
>>>>>>>>>>>>>>
>>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
>>>>>>>>>>>>>>    * Gidon Gershinsky did not find this suggestion
>> compelling
>>>>>> enough
>>>>>>>>>> to
>>>>>>>>>>>>>> override his security concerns.
>>>>>>>>>>>>>>  * Low-level encryption done as third party Python package,
>>>> either
>>>>>>>>>>>>> private
>>>>>>>>>>>>>> or open source. This is annoying technically, plausibly
>> would
>>>>>>>> require
>>>>>>>>>>>>>> maintaining a fork.
>>>>>>>>>>>>>> Any other ideas? Thoughts on these options?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> —Itamar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

Reply via email to