[
https://issues.apache.org/jira/browse/PARQUET-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684275#comment-16684275
]
Ryan Blue commented on PARQUET-1457:
------------------------------------
[~gershinsky], this sounds like a reasonable extension to a table format and
not really something that I think Parquet should be doing.
What do you think about coming up with a proposal for snapshot integrity for
[Iceberg|https://github.com/Netflix/iceberg]?
> Data set integrity tool
> -----------------------
>
> Key: PARQUET-1457
> URL: https://issues.apache.org/jira/browse/PARQUET-1457
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-cpp, parquet-mr
> Reporter: Gidon Gershinsky
> Assignee: Gidon Gershinsky
> Priority: Major
>
> Parquet encryption protects integrity of individual files. However, data sets
> (such as tables) are often written as a collection of files, say
> "/path/to/dataset"/part0.parquet.encrypted
> ..
> "/path/to/dataset"/partN.parquet.encrypted
>
> In an untrusted storage, removal of one or more files will go unnoticed.
> Replacement of one file contents with another will go unnoticed, unless a
> user has provided unique AAD prefixes for each file.
>
> The data set integrity tool solves these problems. While it doesn't
> necessarily belong in Parquet functionality (that is focused on individual
> files (?)) - it will assist higher level frameworks that use Parquet, to
> cryptographically protect integrity of data sets comprised of multiple files.
> The use of this tool is not obligatory, as frameworks can use other means to
> verify table (file collection) integrity.
>
> The tool works by creating a small file, that can be stored as say
> "/path/to/dataset"/.dataset.signature
>
> that contains the dataset unique name (URI) and the number of files (N). The
> file contents is either encrypted with AES-GCM (authenticated, encrypted) -
> or hashed and signed (authenticated, plaintext). A private key issued for
> each dataset.
>
> On the writer side, the tools creates AAD prefixes for every data file, and
> creates the signature file itself. The input is the dataset URI, N and the
> encryption/signature key.
>
> On the reader side, the tool parses and verifies the signature file, and
> provides the framework with the verified dataset name, number of files that
> must be accounted for, and the AAD prefix for each file. The input is the
> expected dataset URI and the encryption/signature key.
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)