Re: Fwd: Exposing low-level Parquet encryption to Python user (or, maybe not)

Antoine Pitrou Wed, 10 Mar 2021 09:37:35 -0800


Hi Gidon,

I'm currently looking at this.

Regards

Antoine.


Le 10/03/2021 à 08:58, Gidon Gershinsky a écrit :

Hi Antoine,

All comments have been handled. Can we ask you to shepherd this PR for the
reminder of its lifecycle? (hopefully, most of this is already behind us).
https://github.com/apache/arrow/pull/8023


Cheers, Gidon


---------- Forwarded message ---------
From: Gidon Gershinsky <gg5...@gmail.com>
Date: Thu, Feb 18, 2021 at 6:25 PM
Subject: Re: Exposing low-level Parquet encryption to Python user (or,
maybe not)
To: dev <dev@arrow.apache.org>


Thanks, then we'll just go ahead and address the remaining comments.

Cheers, Gidon


On Thu, Feb 18, 2021 at 5:45 PM Antoine Pitrou <anto...@python.org> wrote:


I don't think there's any concern around having a process-global shared
key cache.  The discussion was just around the implementation.

Also, FTR, a standalone LRU cache class is proposed here, which may
reduce the amount of original code in the Parquet encryption PR:
https://github.com/apache/arrow/pull/8716

Best regards

Antoine.


Le 18/02/2021 à 16:40, Gidon Gershinsky a écrit :

I believe the shared structures that were debated are the key caches.

Cheers, Gidon


On Thu, Feb 18, 2021 at 6:37 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:


I don't think any notion of threading should be present in the
implementation, except for the required locks around shared structures.



I seem to recall the debate was how to model some class interactions to
determine what should be considered shared structures and what should

not.


On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com>

wrote:

This certainly sounds good to me.

Cheers, Gidon


On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org>

wrote:


I don't think any notion of threading should be present in the
implementation, except for the required locks around shared

structures.

  I don't know where the idea of a "main thread" comes from, but it
probably shouldn't exist in a C++ library.

Regards

Antoine.



Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :

Just to clarify. There are two options, which one do you refer to? A

design

with a main thread that handles projections and the keys (relevant

for

the

projected columns); or the current code with any thread allowed to

handle

full file reading, inc the footer, column projections and their keys?

Can

you finalize this with Micah?
The good news is, Tham is still interested to resume this work, and

is

ok

with either option. Please let her know whether the current threading

model

stays, or should be modified with the changes proposed in the doc

(for

the

latter, some guidance with the details would be needed).

Cheers, Gidon


On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org>

wrote:



Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :

 From the doc,
"To maintain consistency with the style of parquet-cpp, the above
structures should not be explicitly synchronized with individual

mutexes.

In the case of a parquet::arrow::FileReader, the request to read a

given

selection of row groups and columns is issued from a single main

thread.

Note that this does require that all keys required for a read are

assembled

on the main thread so that DecryptionKeyRetriever objects are not

directly

accessing any caches"

The current PR code doesn't require a single main thread. Any

thread

can

read any file, both footer and pages. So the key cache is shared,

to

save

N-1 interactions with the KMS server.


I don't think there's any contention on this.  IMHO the only

concerns

are about the implementation, not the semantics.

Best regards

Antoine.


Cheers, Gidon


On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <

anto...@python.org>

wrote:


I'm not sure a threading model is expected for an encryption

layer.

Am

I missing something?

Regards

Antoine.


Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :

Precisely, the main change is in the threading model. Afaik, the

document

proposes a model that fits pandas, but might be problematic for

other

users

of this library.
Technically, this is not showstopper though; if the community

decides

on

this model, it will be compatible with the high-level encryption

design;

but the change implementation would need to be done by pandas

experts

(not

us; but we'll help where we can).
Micah, you know this subject (and the community) better than we

do

we'd

much appreciate it if you'd take a lead on removing this

roadblock.


Cheers, Gidon


On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <

emkornfi...@gmail.com

wrote:

I think some of the comments might be conflicting.  One of the

concerns

(that I would need to refresh myself on to offer an opinion

which

was

covered in Ben's doc) was the threading model we expect in the

library.


On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <

anto...@python.org>

wrote:


Hi Gidon,

Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :

Regarding the high-level layer, I think it waits for a

progress

at

https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing

No activity there since last November. This is unfortunate,

because

Tham

has put a lot of work in coding the high-level layer (and

addressing

200+

review comments) in the PR

https://github.com/apache/arrow/pull/8023.

The

code is functional, compatible with the Java version in

parquet-mr,

and

can

be updated with the threading changes in the doc above. I hope

all

this

good work will not be wasted.


I'm sorry for the possibly frustrating process.  Looking at the

PR,

though, it seems a bunch of comments were not addressed.  Is it

possible

to go through them and ensure they get an answer and/or a

resolution?


Best regards

Antoine.


Cheers, Gidon


On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <

emkornfi...@gmail.com

wrote:

My thoughts:
1.  I've lost track of the higher level encryption

implementation

in

C++.

I think we were trying to come to a consensus on the

threading/thread

safety model?

2.  I'm open to exposing the lower level encryption libraries

in

python

(without appropriate namespacing/communication).  It seems at

least

for

reading, there is potentially less harm (I'll caveat that

with

I'm

not a

security expert).  Are both the low level read and write

implementations

necessary?  (it probably makes sense to have a few smaller

PRs

for

exposing

this functionality anyways).



On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
ita...@pythonspeed.com> wrote:

Hi,

Since the PR for high-level C++ Parquet encryption API

appears

stalled

https://github.com/apache/arrow/pull/8023), I'm looking

into

exposing

the

low-level Parquet encryption API to Python.

Arguments for doing this: the low-level API is all the users

I'm

talking

to need, at the moment, so it's plausible others would also

find

some

benefit in having the Pyarrow API expose low-level Parquet

encryption.

Then

again, it might only be this one company and no one else

cares.


The arguments against, per Gidon Gershinsky:

  * security: low-level encryption API is easy to misuse (eg

giving

the

same keys for a number of different files; this'd break the

AES

GCM

cipher). The high-level encryption layer handles that by

applying

envelope

encryption and other best practices in data security. Also,

this

layer

is

maintained by the community, meaning that future

improvements

and

security

fixes can be upstreamed by anyone, and available to all.

  * compatibility: parquet-mr implements the high-level

encryption

layer.

If we want the files produced by Spark/Presto/etc to be

readable

by

pandas/PyArrow (and vice versa), we need to provide the

Arrow

users

with

the high-level API.

...

The current situation is not ideal, it'd be good to merge

the

high-level

PR (and maybe hide the low level), but here we are; also,

C++

is a

kind

of

a low-level language; Python would expose it to a less

experienced

audience.


(Source: https://issues.apache.org/jira/browse/ARROW-8040)

I find the compatibility argument less compelling, that's

readily

addressed by documentation. I am not a crypto expert so I

can't

evaluate

how risky exposing the low-level encryption APIs would be,

but I

can

see

how that would be a significant concern.

Some options are:
  * Status quo, no Python API for low-level Parquet

encryption.

This

has

two possible outcomes:
    * Eventually high-level API gets merged, gets Python

binding.

    * High-level encryption API is never merged, Python users

never

get

access to encryption.
  * Add low-level Parquet encryption API to Pyarrow, perhaps

using

"hazmat"

idiom used by the Python cryptography package (API namespace

indicating

"use at your own risk, this is dangerous", basically, e.g.

`cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).

    * Gidon Gershinsky did not find this suggestion

compelling

enough

to

override his security concerns.
  * Low-level encryption done as third party Python package,

either

private

or open source. This is annoying technically, plausibly

would

require

maintaining a fork.
Any other ideas? Thoughts on these options?

—Itamar

Re: Fwd: Exposing low-level Parquet encryption to Python user (or, maybe not)

Reply via email to