[
https://issues.apache.org/jira/browse/PARQUET-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gidon Gershinsky updated PARQUET-1457:
--------------------------------------
Description:
Parquet encryption protects integrity of individual files. However, data sets
(such as tables) are often written as a collection of files, say
"/path/to/dataset"/part0.parquet.encrypted
..
"/path/to/dataset"/partN.parquet.encrypted
In an untrusted storage, removal of one or more files will go unnoticed.
Replacement of one file contents with another will go unnoticed, unless a user
has provided unique AAD prefixes for each file.
The data set integrity tool solves these problems. While it doesn't necessarily
belong in Parquet functionality (that is focused on individual files (?)) - it
will assist higher level frameworks that use Parquet, to cryptographically
protect integrity of data sets comprised of multiple files.
The use of this tool is not obligatory, as frameworks can use other means to
verify table (file collection) integrity.
The tool works by creating a small file, that can be stored as say
"/path/to/dataset"/.dataset.signature
that contains the dataset unique name (URI) and the number of files. It can
also contain an explicit list of file names (with or without full path). The
file contents is either encrypted with AES-GCM (authenticated, encrypted) - or
hashed and signed (authenticated, plaintext).
On the writer side, the tools creates AAD prefixes for every data file, and
creates the signature file itself. The input is the dataset URI, N and the
encryption/signature key; plus (optionally) the list of file names (with or
without full path).
On the reader side, the tool parses and verifies the signature file, and
provides the framework with the verified dataset name, number of files that
must be accounted for, and the AAD prefix for each file; plus (optionally) the
list of file names (with or without full path). The input is the expected
dataset URI and the encryption/signature key.
was:
Parquet encryption protects integrity of individual files. However, data sets
(such as tables) are often written as a collection of files, say
"/path/to/dataset"/part0.parquet.encrypted
..
"/path/to/dataset"/partN.parquet.encrypted
In an untrusted storage, removal of one or more files will go unnoticed.
Replacement of one file contents with another will go unnoticed, unless a user
has provided unique AAD prefixes for each file.
The data set integrity tool solves these problems. While it doesn't necessarily
belong in Parquet functionality (that is focused on individual files (?)) - it
will assist higher level frameworks that use Parquet, to cryptographically
protect integrity of data sets comprised of multiple files.
The use of this tool is not obligatory, as frameworks can use other means to
verify table (file collection) integrity.
The tool works by creating a small file, that can be stored as say
"/path/to/dataset"/.dataset.signature
that contains the dataset unique name (URI) and the number of files (N). The
file contents is either encrypted with AES-GCM (authenticated, encrypted) - or
hashed and signed (authenticated, plaintext). A private key issued for each
dataset.
On the writer side, the tools creates AAD prefixes for every data file, and
creates the signature file itself. The input is the dataset URI, N and the
encryption/signature key.
On the reader side, the tool parses and verifies the signature file, and
provides the framework with the verified dataset name, number of files that
must be accounted for, and the AAD prefix for each file. The input is the
expected dataset URI and the encryption/signature key.
> Data set integrity tool
> -----------------------
>
> Key: PARQUET-1457
> URL: https://issues.apache.org/jira/browse/PARQUET-1457
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-cpp, parquet-mr
> Reporter: Gidon Gershinsky
> Assignee: Gidon Gershinsky
> Priority: Major
>
> Parquet encryption protects integrity of individual files. However, data sets
> (such as tables) are often written as a collection of files, say
> "/path/to/dataset"/part0.parquet.encrypted
> ..
> "/path/to/dataset"/partN.parquet.encrypted
>
> In an untrusted storage, removal of one or more files will go unnoticed.
> Replacement of one file contents with another will go unnoticed, unless a
> user has provided unique AAD prefixes for each file.
>
> The data set integrity tool solves these problems. While it doesn't
> necessarily belong in Parquet functionality (that is focused on individual
> files (?)) - it will assist higher level frameworks that use Parquet, to
> cryptographically protect integrity of data sets comprised of multiple files.
> The use of this tool is not obligatory, as frameworks can use other means to
> verify table (file collection) integrity.
>
> The tool works by creating a small file, that can be stored as say
> "/path/to/dataset"/.dataset.signature
>
> that contains the dataset unique name (URI) and the number of files. It can
> also contain an explicit list of file names (with or without full path). The
> file contents is either encrypted with AES-GCM (authenticated, encrypted) -
> or hashed and signed (authenticated, plaintext).
>
> On the writer side, the tools creates AAD prefixes for every data file, and
> creates the signature file itself. The input is the dataset URI, N and the
> encryption/signature key; plus (optionally) the list of file names (with or
> without full path).
>
> On the reader side, the tool parses and verifies the signature file, and
> provides the framework with the verified dataset name, number of files that
> must be accounted for, and the AAD prefix for each file; plus (optionally)
> the list of file names (with or without full path). The input is the expected
> dataset URI and the encryption/signature key.
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)