[ https://issues.apache.org/jira/browse/PARQUET-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934945#comment-16934945 ]
Gidon Gershinsky edited comment on PARQUET-1659 at 9/21/19 6:10 AM: -------------------------------------------------------------------- A number of comments # The idea behind AES_GCM_CTR is that in Parquet files the metada size is negligible compared to the data size. E.g., the default page size is around a megabyte, and the page header is say a hundred bytes - meaning the metadata is around 0.01% of data. When running with old Java (without AES-NI acceleration), GCM is indeed slower than CTR, but applying it on 0.01% of the file shouldn't be noticeable. # Basic math shows that getting 10% speed up means that in your files the metadata part is around a few percents, instead of 0.01% - hundreds times larger. Would be good to analyze the reasons (small pages? or small files where the footer size becomes a considerable part of the file size? anything else?) - this is easy to measure. # Java 9 and later run AES in hardware, so its possible to get full data protection (both encryption and integrity guarantees) via GCM without noticeable change in throughput. # Introduction of a third algorithm means additions in the spec and in the thrift, which could require another round of PMC vote. We might want to proceed with the parquet-2.7.0 release as is, and in parallel investigate the reasons of the 10% speed up in your files (item 2). If we decide to add a pure CTR algo, it can be a part of say parquet-2.7.1. was (Author: gershinsky): A number of comments # The idea behind AES_GCM_CTR is that in Parquet files the metada size is negligible compared to the data size. E.g., the default page size is around megabyte, and the page header is say a hundred bytes- meaning the metadata is around 0.01% of data. When running with old Java (with AES-NI acceleration), GCM is indeed slower than CTR, but applying it on 0.01% of the file shouldn't be noticeable. # Basic math shows that getting 10% speed up means that in your files the metadata part is around a few percents, instead of 0.01%. Would be good to analyze the reasons (small pages? or small files where the footer size becomes a considerable part of the file size? anything else?) - this is easy to measure. # Java 9 and later run AES in hardware, so its possible to get full data protection (encryption and integrity guarantees) via GCM without noticeable change in throughput. # Introduction of a third algorithm means additions in the spec and in the thrift and could require another round of PMC vote. We might want to proceed with the parquet-2.7.0 release as is, and in parallel investigate the reasons you get 10% speed up (item 2). If we decide to add a pure CTR algo, it can be a part of say parquet-2.7.1. > Add AES-CTR to Parquet Encryption > ---------------------------------- > > Key: PARQUET-1659 > URL: https://issues.apache.org/jira/browse/PARQUET-1659 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp, parquet-format, parquet-mr > Affects Versions: format-2.6.0 > Reporter: Xinli Shang > Priority: Minor > Labels: pull-request-available > > AES-GCM-CTR perform GCM encryption on metadata and CTR encryption on data. > AES-CTR would perform CTR encryption on both. > During Perf testing, we found AES-CTR can improve read/write performance by > ~10% comparing with AES-GCM-CTR. > > I checked with Gidon and the initial assumption was that AES-GCM-CTR would > have similar performance as AES-CTR. But with recent performance > benchmarking, we found it is worthy to introduce AES-CTR. Since many > companies strive for parquet performance improvement. > -- This message was sent by Atlassian Jira (v8.3.4#803005)