This is an automated email from the ASF dual-hosted git repository.

zivanfi pushed a commit to branch encryption
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/encryption by this push:
     new 028b12a  PARQUET-1618: Update encryption spec for bloom filter 
encryption (#141)
028b12a is described below

commit 028b12a83ee434a7cd3c443b42d35c328ea8c708
Author: ggershinsky <[email protected]>
AuthorDate: Mon Jul 15 10:08:59 2019 +0300

    PARQUET-1618: Update encryption spec for bloom filter encryption (#141)
---
 Encryption.md | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/Encryption.md b/Encryption.md
index 63dfadf..a9c54c0 100644
--- a/Encryption.md
+++ b/Encryption.md
@@ -53,7 +53,8 @@ of write/read operations.
 
 ## 3 Technical Approach
 Parquet files are comprised of separately serialized components: pages, page 
headers, column 
-indexes, offset indexes, a footer. Parquet encryption mechanism denotes them 
as “modules” 
+indexes, offset indexes, bloom filter headers and bitsets, the footer. Parquet 
encryption 
+mechanism denotes them as “modules” 
 and encrypts each module separately – making it possible to fetch and decrypt 
the footer, 
 find the offset of required pages, fetch the pages and decrypt the data. In 
this document, 
 the term “footer” always refers to the regular Parquet footer - the 
`FileMetaData` structure, 
@@ -78,15 +79,12 @@ in order to verify its integrity. New footer fields keep an
 information about the file encryption algorithm and the footer signing key.
 
 For encrypted columns, the following modules are always encrypted, with the 
same column key: 
-pages and  page headers (both dictionary and data), column indexes, offset 
indexes.  If the 
+pages and  page headers (both dictionary and data), column indexes, offset 
indexes, bloom filter 
+headers and bitsets.  If the 
 column key is different from the footer encryption key, the column metadata is 
serialized 
 separately and encrypted with the column key. In this case, the column 
metadata is also 
 considered to be a module.  
 
-There are two module types: data modules (pages) and Thrift modules (all 
Thrift structures that 
-are serialized separately).
-
-
 ## 4 Encryption Algorithms and Keys
 Parquet encryption algorithms are based on the standard AES ciphers for 
symmetric encryption. 
 AES is supported in Intel and other CPUs with hardware acceleration of crypto 
operations 
@@ -142,8 +140,8 @@ tag used to verify the ciphertext and AAD integrity.
 
 
 #### 4.2.2 AES_GCM_CTR_V1
-In this Parquet algorithm, all Thrift modules are encrypted with the GCM 
cipher, as described 
-above, but the pages are encrypted by the CTR cipher without padding. This 
allows to encrypt/decrypt 
+In this Parquet algorithm, all modules except pages are encrypted with the GCM 
cipher, as described 
+above. The pages are encrypted by the CTR cipher without padding. This allows 
to encrypt/decrypt 
 the bulk of the data faster, while still verifying the metadata integrity and 
making 
 sure the file has not been replaced with a wrong version. However, tampering 
with the 
 page data might go unnoticed. The AES CTR cipher
@@ -247,6 +245,8 @@ The following module types are defined:
    * Dictionary PageHeader (5)
    * ColumnIndex (6)
    * OffsetIndex (7)
+   * BloomFilter Header (8)
+   * BloomFilter Bitset (9)
 
 
 |                      | Internal File ID | Module type | Row group ordinal | 
Column ordinal | Page ordinal|
@@ -259,14 +259,16 @@ The following module types are defined:
 | Dictionary PageHeader|       yes        |   yes (5)   |        yes        |  
    yes       |     no      |
 | ColumnIndex          |       yes        |   yes (6)   |        yes        |  
    yes       |     no      |
 | OffsetIndex          |       yes        |   yes (7)   |        yes        |  
    yes       |     no      |
+| BloomFilter Header   |       yes        |   yes (8)   |        yes        |  
    yes       |     no      |
+| BloomFilter Bitset   |       yes        |   yes (9)   |        yes        |  
    yes       |     no      |
 
 
 
 ## 5 File Format
 
 ### 5.1 Encrypted module serialization
-The Thrift modules are encrypted with the GCM cipher. In the AES_GCM_V1 
algorithm, 
-the column pages (data modules) are also encrypted with AES GCM. For each 
module, the GCM encryption 
+All modules, except column pages, are encrypted with the GCM cipher. In the 
AES_GCM_V1 algorithm, 
+the column pages are also encrypted with AES GCM. For each module, the GCM 
encryption 
 buffer is comprised of a nonce, ciphertext and tag, described in the 
Algorithms section. The length of 
 the encryption buffer (a 4-byte little endian) is written to the output 
stream, followed by the buffer itself.
 

Reply via email to