Hi Gidon,
I'm currently looking at this.
Regards
Antoine.
Le 10/03/2021 à 08:58, Gidon Gershinsky a écrit :
Hi Antoine,
All comments have been handled. Can we ask you to shepherd this PR for the
reminder of its lifecycle? (hopefully, most of this is already behind us).
https://github.com/apache/arrow/pull/8023
Cheers, Gidon
---------- Forwarded message ---------
From: Gidon Gershinsky <gg5...@gmail.com>
Date: Thu, Feb 18, 2021 at 6:25 PM
Subject: Re: Exposing low-level Parquet encryption to Python user (or,
maybe not)
To: dev <dev@arrow.apache.org>
Thanks, then we'll just go ahead and address the remaining comments.
Cheers, Gidon
On Thu, Feb 18, 2021 at 5:45 PM Antoine Pitrou <anto...@python.org> wrote:
I don't think there's any concern around having a process-global shared
key cache. The discussion was just around the implementation.
Also, FTR, a standalone LRU cache class is proposed here, which may
reduce the amount of original code in the Parquet encryption PR:
https://github.com/apache/arrow/pull/8716
Best regards
Antoine.
Le 18/02/2021 à 16:40, Gidon Gershinsky a écrit :
I believe the shared structures that were debated are the key caches.
Cheers, Gidon
On Thu, Feb 18, 2021 at 6:37 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:
I don't think any notion of threading should be present in the
implementation, except for the required locks around shared structures.
I seem to recall the debate was how to model some class interactions to
determine what should be considered shared structures and what should
not.
On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com>
wrote:
This certainly sounds good to me.
Cheers, Gidon
On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org>
wrote:
I don't think any notion of threading should be present in the
implementation, except for the required locks around shared
structures.
I don't know where the idea of a "main thread" comes from, but it
probably shouldn't exist in a C++ library.
Regards
Antoine.
Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
Just to clarify. There are two options, which one do you refer to? A
design
with a main thread that handles projections and the keys (relevant
for
the
projected columns); or the current code with any thread allowed to
handle
full file reading, inc the footer, column projections and their keys?
Can
you finalize this with Micah?
The good news is, Tham is still interested to resume this work, and
is
ok
with either option. Please let her know whether the current threading
model
stays, or should be modified with the changes proposed in the doc
(for
the
latter, some guidance with the details would be needed).
Cheers, Gidon
On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org>
wrote:
Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
From the doc,
"To maintain consistency with the style of parquet-cpp, the above
structures should not be explicitly synchronized with individual
mutexes.
In the case of a parquet::arrow::FileReader, the request to read a
given
selection of row groups and columns is issued from a single main
thread.
Note that this does require that all keys required for a read are
assembled
on the main thread so that DecryptionKeyRetriever objects are not
directly
accessing any caches"
The current PR code doesn't require a single main thread. Any
thread
can
read any file, both footer and pages. So the key cache is shared,
to
save
N-1 interactions with the KMS server.
I don't think there's any contention on this. IMHO the only
concerns
are about the implementation, not the semantics.
Best regards
Antoine.
Cheers, Gidon
On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <
anto...@python.org>
wrote:
I'm not sure a threading model is expected for an encryption
layer.
Am
I missing something?
Regards
Antoine.
Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
Precisely, the main change is in the threading model. Afaik, the
document
proposes a model that fits pandas, but might be problematic for
other
users
of this library.
Technically, this is not showstopper though; if the community
decides
on
this model, it will be compatible with the high-level encryption
design;
but the change implementation would need to be done by pandas
experts
(not
us; but we'll help where we can).
Micah, you know this subject (and the community) better than we
do
-
we'd
much appreciate it if you'd take a lead on removing this
roadblock.
Cheers, Gidon
On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <
emkornfi...@gmail.com
wrote:
I think some of the comments might be conflicting. One of the
concerns
(that I would need to refresh myself on to offer an opinion
which
was
covered in Ben's doc) was the threading model we expect in the
library.
On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <
anto...@python.org>
wrote:
Hi Gidon,
Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
Regarding the high-level layer, I think it waits for a
progress
at
https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
No activity there since last November. This is unfortunate,
because
Tham
has put a lot of work in coding the high-level layer (and
addressing
200+
review comments) in the PR
https://github.com/apache/arrow/pull/8023.
The
code is functional, compatible with the Java version in
parquet-mr,
and
can
be updated with the threading changes in the doc above. I hope
all
this
good work will not be wasted.
I'm sorry for the possibly frustrating process. Looking at the
PR,
though, it seems a bunch of comments were not addressed. Is it
possible
to go through them and ensure they get an answer and/or a
resolution?
Best regards
Antoine.
Cheers, Gidon
On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
emkornfi...@gmail.com
wrote:
My thoughts:
1. I've lost track of the higher level encryption
implementation
in
C++.
I think we were trying to come to a consensus on the
threading/thread
safety model?
2. I'm open to exposing the lower level encryption libraries
in
python
(without appropriate namespacing/communication). It seems at
least
for
reading, there is potentially less harm (I'll caveat that
with
I'm
not a
security expert). Are both the low level read and write
implementations
necessary? (it probably makes sense to have a few smaller
PRs
for
exposing
this functionality anyways).
On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
ita...@pythonspeed.com> wrote:
Hi,
Since the PR for high-level C++ Parquet encryption API
appears
stalled
(
https://github.com/apache/arrow/pull/8023), I'm looking
into
exposing
the
low-level Parquet encryption API to Python.
Arguments for doing this: the low-level API is all the users
I'm
talking
to need, at the moment, so it's plausible others would also
find
some
benefit in having the Pyarrow API expose low-level Parquet
encryption.
Then
again, it might only be this one company and no one else
cares.
The arguments against, per Gidon Gershinsky:
* security: low-level encryption API is easy to misuse (eg
giving
the
same keys for a number of different files; this'd break the
AES
GCM
cipher). The high-level encryption layer handles that by
applying
envelope
encryption and other best practices in data security. Also,
this
layer
is
maintained by the community, meaning that future
improvements
and
security
fixes can be upstreamed by anyone, and available to all.
* compatibility: parquet-mr implements the high-level
encryption
layer.
If we want the files produced by Spark/Presto/etc to be
readable
by
pandas/PyArrow (and vice versa), we need to provide the
Arrow
users
with
the high-level API.
...
The current situation is not ideal, it'd be good to merge
the
high-level
PR (and maybe hide the low level), but here we are; also,
C++
is a
kind
of
a low-level language; Python would expose it to a less
experienced
audience.
(Source: https://issues.apache.org/jira/browse/ARROW-8040)
I find the compatibility argument less compelling, that's
readily
addressed by documentation. I am not a crypto expert so I
can't
evaluate
how risky exposing the low-level encryption APIs would be,
but I
can
see
how that would be a significant concern.
Some options are:
* Status quo, no Python API for low-level Parquet
encryption.
This
has
two possible outcomes:
* Eventually high-level API gets merged, gets Python
binding.
* High-level encryption API is never merged, Python users
never
get
access to encryption.
* Add low-level Parquet encryption API to Pyarrow, perhaps
using
"hazmat"
idiom used by the Python cryptography package (API namespace
indicating
"use at your own risk, this is dangerous", basically, e.g.
`cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
* Gidon Gershinsky did not find this suggestion
compelling
enough
to
override his security concerns.
* Low-level encryption done as third party Python package,
either
private
or open source. This is annoying technically, plausibly
would
require
maintaining a fork.
Any other ideas? Thoughts on these options?
—Itamar