[ 
https://issues.apache.org/jira/browse/PARQUET-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684275#comment-16684275
 ] 

Ryan Blue commented on PARQUET-1457:
------------------------------------

[~gershinsky], this sounds like a reasonable extension to a table format and 
not really something that I think Parquet should be doing.

What do you think about coming up with a proposal for snapshot integrity for 
[Iceberg|https://github.com/Netflix/iceberg]?

> Data set integrity tool
> -----------------------
>
>                 Key: PARQUET-1457
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1457
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-cpp, parquet-mr
>            Reporter: Gidon Gershinsky
>            Assignee: Gidon Gershinsky
>            Priority: Major
>
> Parquet encryption protects integrity of individual files. However, data sets 
> (such as tables) are often written as a collection of files, say
> "/path/to/dataset"/part0.parquet.encrypted
> ..
> "/path/to/dataset"/partN.parquet.encrypted
>  
> In an untrusted storage, removal of one or more files will go unnoticed. 
> Replacement of one file contents with another will go unnoticed, unless a 
> user has provided unique AAD prefixes for each file.
>  
> The data set integrity tool solves these problems. While it doesn't 
> necessarily belong in Parquet functionality (that is focused on individual 
> files (?)) - it will assist higher level frameworks that use Parquet, to 
> cryptographically protect integrity of data sets comprised of multiple files.
> The use of this tool is not obligatory, as frameworks can use other means to 
> verify table (file collection) integrity.
>  
> The tool works by creating a small file, that can be stored as say
> "/path/to/dataset"/.dataset.signature
>  
> that contains the dataset unique name (URI) and the number of files (N). The 
> file contents is either encrypted with AES-GCM  (authenticated, encrypted) - 
> or hashed and signed (authenticated, plaintext). A private key issued for 
> each dataset.
>  
> On the writer side, the tools creates AAD prefixes for every data file, and 
> creates the signature file itself. The input is the dataset URI, N and the 
> encryption/signature key.
>  
> On the reader side, the tool parses and verifies the signature file, and 
> provides the framework with the verified dataset name, number of files that 
> must be accounted for, and the AAD prefix for each file. The input is the 
> expected dataset URI and the encryption/signature key.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to