[jira] [Updated] (PARQUET-1301) [C++] Crypto package in parquet-cpp

2018-08-04 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1301:
---
Fix Version/s: 1.5.0

> [C++] Crypto package in parquet-cpp
> ---
>
> Key: PARQUET-1301
> URL: https://issues.apache.org/jira/browse/PARQUET-1301
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.5.0
>
>
> The C++ implementation of basic AES-GCM encryption and decryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1301) [C++] Crypto package in parquet-cpp

2018-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1301:

Labels: pull-request-available  (was: )

> [C++] Crypto package in parquet-cpp
> ---
>
> Key: PARQUET-1301
> URL: https://issues.apache.org/jira/browse/PARQUET-1301
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>
> The C++ implementation of basic AES-GCM encryption and decryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1301) [C++] Crypto package in parquet-cpp

2018-08-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569313#comment-16569313
 ] 

ASF GitHub Bot commented on PARQUET-1301:
-

majetideepak closed pull request #464: PARQUET-1301: [C++] Crypto package in 
parquet-cpp
URL: https://github.com/apache/parquet-cpp/pull/464
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/types.h b/src/parquet/types.h
index 0f4cfc21..aec99656 100644
--- a/src/parquet/types.h
+++ b/src/parquet/types.h
@@ -113,6 +113,14 @@ struct Compression {
   enum type { UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD };
 };
 
+struct Encryption {
+  enum type {
+AES_GCM_V1 = 0,
+AES_GCM_CTR_V1 = 1
+  };
+};
+
+
 // parquet::PageType
 struct PageType {
   enum type { DATA_PAGE, INDEX_PAGE, DICTIONARY_PAGE, DATA_PAGE_V2 };
diff --git a/src/parquet/util/crypto.cc b/src/parquet/util/crypto.cc
new file mode 100644
index ..59383d18
--- /dev/null
+++ b/src/parquet/util/crypto.cc
@@ -0,0 +1,369 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/util/crypto.h"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "parquet/exception.h"
+
+using parquet::ParquetException;
+
+namespace parquet_encryption {
+
+constexpr int aesGcm = 0;
+constexpr int aesCtr = 1;
+constexpr int encryptType = 0;
+constexpr int decryptType = 1;
+constexpr int gcmTagLen = 16;
+constexpr int gcmIvLen = 12;
+constexpr int ctrIvLen = 16;
+constexpr int rndMaxBytes = 32;
+
+#define ENCRYPT_INIT(CTX, ALG)\
+  if (1 != EVP_EncryptInit_ex(CTX, ALG, nullptr, nullptr, nullptr)) { \
+throw ParquetException("Couldn't init ALG encryption");   \
+  }
+
+#define DECRYPT_INIT(CTX, ALG)\
+  if (1 != EVP_DecryptInit_ex(CTX, ALG, nullptr, nullptr, nullptr)) { \
+throw ParquetException("Couldn't init ALG decryption");   \
+  }
+
+class EvpCipher {
+ public:
+  explicit EvpCipher(int cipher, int key_len, int type) {
+ctx_ = nullptr;
+
+if (aesGcm != cipher && aesCtr != cipher) {
+  std::stringstream ss;
+  ss << "Wrong cipher: " << cipher;
+  throw ParquetException(ss.str());
+}
+
+if (16 != key_len && 24 != key_len && 32 != key_len) {
+  std::stringstream ss;
+  ss << "Wrong key length: " << key_len;
+  throw ParquetException(ss.str());
+}
+
+if (encryptType != type && decryptType != type) {
+  std::stringstream ss;
+  ss << "Wrong cipher type: " << type;
+  throw ParquetException(ss.str());
+}
+
+ctx_ = EVP_CIPHER_CTX_new();
+if (nullptr == ctx_) {
+  throw ParquetException("Couldn't init cipher context");
+}
+
+if (aesGcm == cipher) {
+  // Init AES-GCM with specified key length
+  if (16 == key_len) {
+if (encryptType == type) {
+  ENCRYPT_INIT(ctx_, EVP_aes_128_gcm());
+} else {
+  DECRYPT_INIT(ctx_, EVP_aes_128_gcm());
+}
+  } else if (24 == key_len) {
+if (encryptType == type) {
+  ENCRYPT_INIT(ctx_, EVP_aes_192_gcm());
+} else {
+  DECRYPT_INIT(ctx_, EVP_aes_192_gcm());
+}
+  } else if (32 == key_len) {
+if (encryptType == type) {
+  ENCRYPT_INIT(ctx_, EVP_aes_256_gcm());
+} else {
+  DECRYPT_INIT(ctx_, EVP_aes_256_gcm());
+}
+  }
+} else {
+  // Init AES-CTR with specified key length
+  if (16 == key_len) {
+if (encryptType == type) {
+  ENCRYPT_INIT(ctx_, EVP_aes_128_ctr());
+} else {
+  DECRYPT_INIT(ctx_, EVP_aes_128_ctr());
+}
+  } else if (24 == key_len) {
+if (encryptType == type) {
+  ENCRYPT_INIT(ctx_, EVP_aes_192_ctr());
+} else {
+  DECRYPT_INIT(ctx_, EVP_aes_192_ctr());
+}
+  } else if (32 == key_len) {
+ 

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-04 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569210#comment-16569210
 ] 

Wes McKinney commented on PARQUET-1370:
---

I have opened some issues related to buffering / concurrent IO in C++, e.g. 
https://issues.apache.org/jira/browse/ARROW-501

[~rgruener] In 0.10.0 the pyarrow file handles implement RawIOBase now

I don't think it would be to difficult to add a buffering reader to the Parquet 
hot path with a configurable buffer size. We already have a 
{{BufferedInputStream}} which may help

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1370) [C++] Read consecutive column chunks in a single scan

2018-08-04 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1370:
--
Summary: [C++] Read consecutive column chunks in a single scan  (was: Read 
consecutive column chunks in a single scan)

> [C++] Read consecutive column chunks in a single scan
> -
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)