Re: Adding Parquet encryption support to PyArrow

Roee Shlomo Thu, 03 Sep 2020 12:12:16 -0700

Hi Itamar,

I implemented some python wrappers for the low level API and would be happy to 
collaborate on that. The reason I didn't push this forward yet is what Gidon 
mentioned. The API to expose to python users needs to be finalized first and it 
must include the key tools API for interop with Spark.

Perhaps it would be good to kickoff a discussion on how the pyarrow API for PME 
should look like (in parallel to reviewing the arrow-cpp implementation of 
key-tools; to ensure that wrapping it would be a reasonable effort).

One possible approach is to expose both the low level API and keytools 
separately. A user would create and initialize a PropertiesDrivenCryptoFactory 
and use it to create the FileEncryptionProperties/FileDecryptionProperties to 
pass to the lower level API. In pandas this would translate to something like:
```
factory = PropertiesDrivenCryptoFactory(...)
df.to_parquet(path, engine="pyarrow", 
encryption=factory.getFileEncryptionProperties(...))
df = pd.read_parquet(path, engine="pyarrow", 
decryption=factory.getFileDecryptionProperties(...))
```
This should also work with reading datasets since decryption uses a 
KeyRetriever, but I'm not sure what will need to be done once datasets will 
support write.

On 2020/09/03 14:11:51, "Itamar Turner-Trauring" <ita...@pythonspeed.com> 
wrote: 
> Hi,
> 
> I'm looking into implementing this, and it seems like there are two parts: 
> packaging, but also wrapping the APIs in Python. Is the latter item accurate? 
> If so, any examples of similar existing wrapped APIs, or should I just come 
> up with something on my own?
> 
> Context:
> https://github.com/apache/arrow/pull/4826
> https://issues.apache.org/jira/browse/ARROW-8040
> 
> Thanks,
> 
> —Itamar

Re: Adding Parquet encryption support to PyArrow

Reply via email to