[ 
https://issues.apache.org/jira/browse/PARQUET-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1457:
----------------------------------
    Summary: [C++] Data set integrity tool  (was: Data set integrity tool)

> [C++] Data set integrity tool
> -----------------------------
>
>                 Key: PARQUET-1457
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1457
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-cpp, parquet-mr
>            Reporter: Gidon Gershinsky
>            Assignee: Gidon Gershinsky
>            Priority: Major
>
> Parquet encryption protects integrity of individual files. However, data sets 
> (such as tables) are often written as a collection of files, say
> "/path/to/dataset"/part0.parquet.encrypted
> ..
> "/path/to/dataset"/partN.parquet.encrypted
>  
> In an untrusted storage, removal of one or more files will go unnoticed. 
> Replacement of one file contents with another will go unnoticed, unless a 
> user has provided unique AAD prefixes for each file.
>  
> The data set integrity tool solves these problems. While it doesn't 
> necessarily belong in Parquet functionality (that is focused on individual 
> files (?)) - it will assist higher level frameworks that use Parquet, to 
> cryptographically protect integrity of data sets comprised of multiple files.
> The use of this tool is not obligatory, as frameworks can use other means to 
> verify table (file collection) integrity.
>  
> The tool works by creating a small file, that can be stored as say
> "/path/to/dataset"/.dataset.signature
>  
> that contains the dataset unique name (URI) and the number of files. It can 
> also contain an explicit list of file names (with or without full path). The 
> file contents is either encrypted with AES-GCM  (authenticated, encrypted) - 
> or hashed and signed (authenticated, plaintext). 
>  
> On the writer side, the tools creates AAD prefixes for every data file, and 
> creates the signature file itself. The input is the dataset URI, N and the 
> encryption/signature key; plus (optionally) the list of file names (with or 
> without full path).
>  
> On the reader side, the tool parses and verifies the signature file, and 
> provides the framework with the verified dataset name, number of files that 
> must be accounted for, and the AAD prefix for each file;  plus (optionally) 
> the list of file names (with or without full path). The input is the expected 
> dataset URI and the encryption/signature key.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to