Gidon Gershinsky created PARQUET-1457:
-----------------------------------------

             Summary: Data set integrity tool
                 Key: PARQUET-1457
                 URL: https://issues.apache.org/jira/browse/PARQUET-1457
             Project: Parquet
          Issue Type: New Feature
          Components: parquet-cpp, parquet-mr
            Reporter: Gidon Gershinsky
            Assignee: Gidon Gershinsky


Parquet encryption protects integrity of individual files. However, data sets 
(such as tables) are often written as a collection of files, say

"/path/to/dataset"/part0.parquet.encrypted

..

"/path/to/dataset"/partN.parquet.encrypted

 

In an untrusted storage, removal of one or more files will go unnoticed. 
Replacement of one file contents with another will go unnoticed, unless a user 
has provided unique AAD prefixes for each file.

 

The data set integrity tool solves these problems. While it doesn't necessarily 
belong in Parquet functionality (that is focused on individual files (?)) - it 
will assist higher level frameworks that use Parquet, to cryptographically 
protect integrity of data sets comprised of multiple files.

The use of this tool is not obligatory, as frameworks can use other means to 
verify table (file collection) integrity.

 

The tool works by creating a small file, that can be stored as say

"/path/to/dataset"/.dataset.signature

 

that contains the dataset unique name (URI) and the number of files (N). The 
file contents is either encrypted with AES-GCM  (authenticated, encrypted) - or 
hashed and signed (authenticated, plaintext). A private key issued for each 
dataset.

 

On the writer side, the tools creates AAD prefixes for every data file, and 
creates the signature file itself. The input is the dataset URI, N and the 
encryption/signature key.

 

On the reader side, the tool parses and verifies the signature file, and 
provides the framework with the verified dataset name, number of files that 
must be accounted for, and the AAD prefix for each file. The input is the 
expected dataset URI and the encryption/signature key.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to