[ 
https://issues.apache.org/jira/browse/PARQUET-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934945#comment-16934945
 ] 

Gidon Gershinsky edited comment on PARQUET-1659 at 9/21/19 6:10 AM:
--------------------------------------------------------------------

A number of comments
 # The idea behind AES_GCM_CTR is that in Parquet files the metada size is 
negligible compared to the data size. E.g., the default page size is around a 
megabyte, and the page header is say a hundred bytes - meaning the metadata is 
around 0.01% of data. When running with old Java (without AES-NI acceleration), 
GCM is indeed slower than CTR, but applying it on 0.01% of the file shouldn't 
be noticeable.
 # Basic math shows that getting 10% speed up means that in your files the 
metadata part is around a few percents, instead of 0.01% - hundreds times 
larger. Would be good to analyze the reasons (small pages? or small files where 
the footer size becomes a considerable part of  the file size? anything else?) 
- this is easy to measure.
 # Java 9 and later run AES in hardware, so its possible to get full data 
protection (both encryption and integrity guarantees) via GCM without 
noticeable change in throughput.
 # Introduction of a third algorithm means additions in the spec and in the 
thrift, which could require another round of PMC vote. We might want to proceed 
with the parquet-2.7.0 release as is, and in parallel investigate the reasons 
of the 10% speed up in your files (item 2). If we decide to add a pure CTR 
algo, it can be a part of say parquet-2.7.1.


was (Author: gershinsky):
A number of comments
 # The idea behind AES_GCM_CTR is that in Parquet files the metada size is 
negligible compared to the data size. E.g., the default page size is around 
megabyte, and the page header is say a hundred bytes- meaning the metadata is 
around 0.01% of data. When running with old Java (with AES-NI acceleration), 
GCM is indeed slower than CTR, but applying it on 0.01% of the file shouldn't 
be noticeable.
 # Basic math shows that getting 10% speed up means that in your files the 
metadata part is around a few percents, instead of 0.01%. Would be good to 
analyze the reasons (small pages? or small files where the footer size becomes 
a considerable part of  the file size? anything else?) - this is easy to 
measure.
 # Java 9 and later run AES in hardware, so its possible to get full data 
protection (encryption and integrity guarantees) via GCM without noticeable 
change in throughput.
 # Introduction of a third algorithm means additions in the spec and in the 
thrift and could require another round of PMC vote. We might want to proceed 
with the parquet-2.7.0 release as is, and in parallel investigate the reasons 
you get 10% speed up (item 2). If we decide to add a pure CTR algo, it can be a 
part of say parquet-2.7.1.

> Add AES-CTR to Parquet Encryption 
> ----------------------------------
>
>                 Key: PARQUET-1659
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1659
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp, parquet-format, parquet-mr
>    Affects Versions: format-2.6.0
>            Reporter: Xinli Shang
>            Priority: Minor
>              Labels: pull-request-available
>
> AES-GCM-CTR perform GCM encryption on metadata and CTR encryption on data.
> AES-CTR would perform CTR encryption on both. 
> During Perf testing, we found AES-CTR can improve read/write performance by 
> ~10% comparing with AES-GCM-CTR.
>  
> I checked with Gidon and the initial assumption was that AES-GCM-CTR would 
> have similar performance as AES-CTR. But with recent performance 
> benchmarking, we found it is worthy to introduce AES-CTR. Since many 
> companies strive for parquet performance improvement. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to