[jira] [Commented] (PARQUET-1232) Document the modular encryption in parquet-format

2018-10-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649776#comment-16649776
 ] 

ASF GitHub Bot commented on PARQUET-1232:
-

zivanfi closed pull request #110: PARQUET-1232: Encryption docs
URL: https://github.com/apache/parquet-format/pull/110
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/Encryption.md b/Encryption.md
new file mode 100644
index ..156d9ce7
--- /dev/null
+++ b/Encryption.md
@@ -0,0 +1,234 @@
+
+
+# Parquet Modular Encryption
+
+Parquet files, containing sensitive information, can be protected by the 
modular
+encryption mechanism, that encrypts and authenticates the file data and 
metadata - 
+while allowing for a regular Parquet functionality (columnar projection, 
+predicate pushdown, encoding and compression). The mechanism also enables 
column access 
+control, via support for encryption of different columns with different keys.
+
+## Problem Statement
+The existing data protection solutions (such as flat encryption of files, 
in-storage 
+encryption, or a use of an encrypting storage client) can be applied to 
Parquet files,
+but have various security or performance issues. An encryption mechanism, 
integrated in
+the Parquet format, allows for an optimal combination of data security, 
processing
+speed and access control granularity.
+
+
+## Goals
+1. Protect Parquet data and metadata by encryption, while enabling selective 
reads 
+(columnar projection, predicate push-down).
+2. Implement "client-side" encryption/decryption (storage client). The storage 
server 
+must not see plaintext data, metadata or encryption keys.
+3. Leverage authenticated encryption that allows clients to check integrity of 
the 
+retrieved data - making sure the file (or file parts) had not been replaced 
with a 
+wrong version, or tampered with otherwise.
+4. Support column access control - by enabling different encryption keys for 
different 
+columns, and for the footer.
+5. Allow for partial encryption - encrypt only column(s) with sensitive data.
+6. Work with all compression and encoding mechanisms supported in Parquet.
+7. Support multiple encryption algorithms, to account for different security 
and 
+performance requirements.
+8. Enable two modes for metadata protection:
+   * full protection of file metadata
+   * partial protection of file metadata, that allows legacy readers to access 
unencrypted 
+ columns in an encrypted file.
+9. Miminize overhead of encryption: in terms of size of encrypted files, and 
throughput
+of write/read operations.
+
+
+## Technical Approach
+
+Each Parquet module (footer, page headers, pages, column indexes, column 
metadata) is 
+encrypted separately. Then it is possible to fetch and decrypt the footer, 
find the 
+offset of a required page, fetch it and decrypt the data. In this document, 
the term 
+“footer” always refers to the regular Parquet footer - the `FileMetaData` 
structure, and 
+its nested fields (row groups / column chunks).
+
+The results of compression of column pages are encrypted, before being written 
to the 
+output stream. A new Thrift structure, with a column crypto metadata, is added 
to 
+column chunks of the encrypted columns. This metadata provides information 
about the 
+column encryption keys.
+
+The results of Thrift serialization of metadata structures are encrypted, 
before being 
+written to the output stream.
+
+## Encryption algorithms
+
+Parquet encryption algorithms are based on the standard AES ciphers for 
symmetric 
+encryption. AES is supported in Intel and other CPUs with hardware 
acceleration of 
+crypto operations (“AES-NI”) - that can be leveraged by e.g. Java programs 
+(automatically via HotSpot), or C++ programs (via EVP-* functions in OpenSSL).
+
+Initially, two algorithms are implemented, one based on a GCM mode of AES, and 
the other 
+on a combination of GCM and CTR modes.
+
+AES-GCM is an authenticated encryption. Besides the data confidentiality 
(encryption), it 
+supports two levels of integrity verification / authentication: of the data 
(default), and 
+of the data combined with an optional AAD (“additional authenticated data”). 
The default 
+authentication allows to make sure the data has not been tampered with. An AAD 
is a free 
+text to be signed, together with the data. The user can, for example, pass the 
file name 
+with its version (or creation timestamp) as the AAD, to verify the file has 
not been 
+replaced with an older version.
+
+Sometimes, a hardware acceleration of AES is unavialable (e.g. in Java 8). 
Then AES crypto 
+operations are implemented in software, and can be somewhat slow, becoming a 
performance 
+bottleneck in 

[jira] [Commented] (PARQUET-1232) Document the modular encryption in parquet-format

2018-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631346#comment-16631346
 ] 

ASF GitHub Bot commented on PARQUET-1232:
-

ggershinsky closed pull request #101: PARQUET-1232: Encryption docs
URL: https://github.com/apache/parquet-format/pull/101
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/Encryption.md b/Encryption.md
new file mode 100644
index ..46f60676
--- /dev/null
+++ b/Encryption.md
@@ -0,0 +1,221 @@
+
+
+# Parquet Modular Encryption
+
+Parquet files, containing sensitive information, can be protected by the 
modular
+encryption mechanism, that encrypts and authenticates the file data and 
metadata - 
+while allowing for a regular Parquet functionality (columnar projection, 
+predicate pushdown, encoding and compression). The mechanism also enables 
column access 
+control, via support for encryption of different columns with different keys.
+
+## Problem Statement
+The existing data protection solutions (such as flat encryption of files, 
in-storage 
+encryption, or a use of an encrypting storage client) can be applied to 
Parquet files,
+but have various security or performance issues. An encryption mechanism, 
integrated in
+the Parquet format, allows for an optimal combination of data security, 
processing
+speed and access control granularity.
+
+
+## Goals
+1. Protect Parquet data and metadata by encryption, while enabling selective 
reads 
+(columnar projection, predicate push-down).
+2. Implement "client-side" encryption/decryption (storage client). The storage 
server 
+must not see plaintext data, metadata or encryption keys.
+3. Leverage authenticated encryption that allows clients to check integrity of 
the 
+retrieved data - making sure the file (or file parts) had not been replaced 
with a 
+wrong version, or tampered with otherwise.
+4. Support column access control - by enabling different encryption keys for 
different 
+columns, and for the footer.
+5. Allow for partial encryption - encrypt only column(s) with sensitive data.
+6. Work with all compression and encoding mechanisms supported in Parquet.
+7. Support multiple encryption algorithms, to account for different security 
and 
+performance requirements.
+8. Enable two modes for metadata protection:
+- full protection of file metadata
+- partial protection of file metadata, that allows old readers to access 
unencrypted 
+ columns in an encrypted file.
+
+
+## Technical Approach
+
+Each Parquet module (footer, page headers, pages, column indexes, column 
metadata) is 
+encrypted separately. Then it is possible to fetch and decrypt the footer, 
find the 
+offset of a required page, fetch it and decrypt the data. In this document, 
the term 
+“footer” always refers to the regular Parquet footer - the `FileMetaData` 
structure, and 
+its nested fields (row groups / column chunks).
+
+The results of compression of column pages are encrypted, before being written 
to the 
+output stream. A new Thrift structure, with a column crypto metadata, is added 
to 
+column chunks of the encrypted columns. This metadata provides information 
about the 
+column encryption keys.
+
+The results of Thrift serialization of metadata structures are encrypted, 
before being 
+written to the output stream.
+
+## Encryption algorithms
+
+Parquet encryption algorithms are based on the standard AES ciphers for 
symmetric 
+encryption. AES is supported in Intel and other CPUs with hardware 
acceleration of 
+crypto operations (“AES-NI”) - that can be leveraged by e.g. Java programs 
+(automatically via HotSpot), or C++ programs (via EVP-* functions in OpenSSL).
+
+Initially, two algorithms are implemented, one based on a GCM mode of AES, and 
the other 
+on a combination of GCM and CTR modes.
+
+AES-GCM is an authenticated encryption. Besides the data confidentiality 
(encryption), it 
+supports two levels of integrity verification / authentication: of the data 
(default), and 
+of the data combined with an optional AAD (“additional authenticated data”). 
The default 
+authentication allows to make sure the data has not been tampered with. An AAD 
is a free 
+text to be signed, together with the data. The user can, for example, pass the 
file name 
+with its version (or creation timestamp) as the AAD, to verify the file has 
not been 
+replaced with an older version.
+
+Sometimes, a hardware acceleration of AES is unavialable (e.g. in Java 8). 
Then AES crypto 
+operations are implemented in software, and can be somewhat slow, becoming a 
performance 
+bottleneck in certain workloads. AES-CTR is a regular (not authenticated) 
cipher.
+It is faster than AES-GCM, since it doesn’t perform 

[jira] [Commented] (PARQUET-1232) Document the modular encryption in parquet-format

2018-08-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574432#comment-16574432
 ] 

ASF GitHub Bot commented on PARQUET-1232:
-

ggershinsky opened a new pull request #101: PARQUET-1232: Encryption docs
URL: https://github.com/apache/parquet-format/pull/101
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Document the modular encryption in parquet-format
> -
>
> Key: PARQUET-1232
> URL: https://issues.apache.org/jira/browse/PARQUET-1232
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>
> Create Encryption.md from the design googledoc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1232) Document the modular encryption in parquet-format

2018-08-02 Thread Gidon Gershinsky (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566910#comment-16566910
 ] 

Gidon Gershinsky commented on PARQUET-1232:
---

Will try to send a PR with Encryption.md (and changes in Readme.md) by the end 
of next week.

> Document the modular encryption in parquet-format
> -
>
> Key: PARQUET-1232
> URL: https://issues.apache.org/jira/browse/PARQUET-1232
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Create Encryption.md from the design googledoc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)