GPSnoopy commented on pull request #9631:
URL: https://github.com/apache/arrow/pull/9631#issuecomment-831553755


   Hi @ggershinsky,
   
   I'm one of the end-users pestering @itamarst. ;-)
   
   I don't claim to be an encryption expert, so the following feedback is 
purely from a user/developer perspective. 
   
   - We already have an in-house crypto library which handles the security 
choices, design and the integration with our KMS.
   - The integration of this library with Apache Arrow/Parquet (via 
ParquetSharp) is about 10-20 lines of code.
   - This crypto library generates the AES key, encrypts it using asymmetric 
keys (obtained via the KMS, driven by an company-internal user provided key 
identifier), adds some extra necessary header information and publishes that to 
Parquet as the key identifier.
   - It also deals with user authentication and key permissions.
   - This means that the way we manage Parquet encryption inside the company is 
consistent with the rest of the company; approved by the various security teams.
   - Being compatible with other external tools and a de-facto Parquet 
encryption high-level standard is nice, but ultimately the company cares about 
its own sensitive IP. So being compatible with the company ecosystem is higher 
priority than being compatible with Spark (ultimately we will never share 
encrypted files with other companies, kind of the main point).
   - The low-level API is internally used by us in both C++ and C#. So why is 
Python different?
   - I'm not sure I understand or appreciate the reluctance to provide both the 
low-level and higher-level API. It's a really nice property of a library to 
expose various level of abstraction, such that the user can integrate with the 
library at the required level. Having both APIs means that you provide the 
correct default behaviour and compatibility with Spark ecosystem for your 
users, and also **provide the necessary flexibility for users with use-cases 
you have not anticipated or foreseen**.
   
   IMHO the last point should be carefully considered, as it's reflected and 
used in highly acclaimed libraries and APIs, such as C++ STL, Boost, Zlib, 
Vulkan, etc (personal bias in this choice of libraries, of course; 
interestingly DirectX12/Vulkan do prove a point though - developers want more 
fine grained access and level of controls in their API, not less).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to