[jira] [Commented] (PARQUET-1232) Document the modular encryption in parquet-format

2018-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631346#comment-16631346
 ] 

ASF GitHub Bot commented on PARQUET-1232:
-

ggershinsky closed pull request #101: PARQUET-1232: Encryption docs
URL: https://github.com/apache/parquet-format/pull/101
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/Encryption.md b/Encryption.md
new file mode 100644
index ..46f60676
--- /dev/null
+++ b/Encryption.md
@@ -0,0 +1,221 @@
+
+
+# Parquet Modular Encryption
+
+Parquet files, containing sensitive information, can be protected by the 
modular
+encryption mechanism, that encrypts and authenticates the file data and 
metadata - 
+while allowing for a regular Parquet functionality (columnar projection, 
+predicate pushdown, encoding and compression). The mechanism also enables 
column access 
+control, via support for encryption of different columns with different keys.
+
+## Problem Statement
+The existing data protection solutions (such as flat encryption of files, 
in-storage 
+encryption, or a use of an encrypting storage client) can be applied to 
Parquet files,
+but have various security or performance issues. An encryption mechanism, 
integrated in
+the Parquet format, allows for an optimal combination of data security, 
processing
+speed and access control granularity.
+
+
+## Goals
+1. Protect Parquet data and metadata by encryption, while enabling selective 
reads 
+(columnar projection, predicate push-down).
+2. Implement "client-side" encryption/decryption (storage client). The storage 
server 
+must not see plaintext data, metadata or encryption keys.
+3. Leverage authenticated encryption that allows clients to check integrity of 
the 
+retrieved data - making sure the file (or file parts) had not been replaced 
with a 
+wrong version, or tampered with otherwise.
+4. Support column access control - by enabling different encryption keys for 
different 
+columns, and for the footer.
+5. Allow for partial encryption - encrypt only column(s) with sensitive data.
+6. Work with all compression and encoding mechanisms supported in Parquet.
+7. Support multiple encryption algorithms, to account for different security 
and 
+performance requirements.
+8. Enable two modes for metadata protection:
+- full protection of file metadata
+- partial protection of file metadata, that allows old readers to access 
unencrypted 
+ columns in an encrypted file.
+
+
+## Technical Approach
+
+Each Parquet module (footer, page headers, pages, column indexes, column 
metadata) is 
+encrypted separately. Then it is possible to fetch and decrypt the footer, 
find the 
+offset of a required page, fetch it and decrypt the data. In this document, 
the term 
+“footer” always refers to the regular Parquet footer - the `FileMetaData` 
structure, and 
+its nested fields (row groups / column chunks).
+
+The results of compression of column pages are encrypted, before being written 
to the 
+output stream. A new Thrift structure, with a column crypto metadata, is added 
to 
+column chunks of the encrypted columns. This metadata provides information 
about the 
+column encryption keys.
+
+The results of Thrift serialization of metadata structures are encrypted, 
before being 
+written to the output stream.
+
+## Encryption algorithms
+
+Parquet encryption algorithms are based on the standard AES ciphers for 
symmetric 
+encryption. AES is supported in Intel and other CPUs with hardware 
acceleration of 
+crypto operations (“AES-NI”) - that can be leveraged by e.g. Java programs 
+(automatically via HotSpot), or C++ programs (via EVP-* functions in OpenSSL).
+
+Initially, two algorithms are implemented, one based on a GCM mode of AES, and 
the other 
+on a combination of GCM and CTR modes.
+
+AES-GCM is an authenticated encryption. Besides the data confidentiality 
(encryption), it 
+supports two levels of integrity verification / authentication: of the data 
(default), and 
+of the data combined with an optional AAD (“additional authenticated data”). 
The default 
+authentication allows to make sure the data has not been tampered with. An AAD 
is a free 
+text to be signed, together with the data. The user can, for example, pass the 
file name 
+with its version (or creation timestamp) as the AAD, to verify the file has 
not been 
+replaced with an older version.
+
+Sometimes, a hardware acceleration of AES is unavialable (e.g. in Java 8). 
Then AES crypto 
+operations are implemented in software, and can be somewhat slow, becoming a 
performance 
+bottleneck in certain workloads. AES-CTR is a regular (not authenticated) 
cipher.
+It is faster than AES-GCM, since it doesn’t perform 

[jira] [Updated] (PARQUET-1419) Enable old readers to access unencrypted columns in files with plaintext footer

2018-09-27 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1419:

Labels: pull-request-available  (was: )

> Enable old readers to access unencrypted columns in files with plaintext 
> footer
> ---
>
> Key: PARQUET-1419
> URL: https://issues.apache.org/jira/browse/PARQUET-1419
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format, parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1419) Enable old readers to access unencrypted columns in files with plaintext footer

2018-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631344#comment-16631344
 ] 

ASF GitHub Bot commented on PARQUET-1419:
-

ggershinsky closed pull request #106: PARQUET-1419: enable old readers to 
access unencrypted columns in files with plaint…
URL: https://github.com/apache/parquet-format/pull/106
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index c05e871b..6e16f2d7 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -860,6 +860,32 @@ struct ColumnIndex {
   5: optional list null_counts
 }
 
+struct AesGcmV1 {
+  /** Retrieval metadata of AAD used for encryption of pages and structures **/
+  1: optional binary aad_metadata
+
+  /** If file IVs are comprised of a fixed part, and variable parts
+   *  (e.g. counter), keep the fixed part here **/
+  2: optional binary iv_prefix
+ 
+}
+
+struct AesGcmCtrV1 {
+  /** Retrieval metadata of AAD used for encryption of structures **/
+  1: optional binary aad_metadata
+
+  /** If file IVs are comprised of a fixed part, and variable parts
+   *  (e.g. counter), keep the fixed part here **/
+  2: optional binary gcm_iv_prefix
+
+  3: optional binary ctr_iv_prefix
+}
+
+union EncryptionAlgorithm {
+  1: AesGcmV1 AES_GCM_V1
+  2: AesGcmCtrV1 AES_GCM_CTR_V1
+}
+
 /**
  * Description for file metadata
  */
@@ -902,46 +928,20 @@ struct FileMetaData {
* The obsolete min and max fields are always sorted by signed comparison
* regardless of column_orders.
*/
-  7: optional list column_orders;
-}
-
-struct AesGcmV1 {
-  /** Retrieval metadata of AAD used for encryption of pages and structures **/
-  1: optional binary aad_metadata
-
-  /** If file IVs are comprised of a fixed part, and variable parts
-   *  (e.g. counter), keep the fixed part here **/
-  2: optional binary iv_prefix
- 
-}
-
-struct AesGcmCtrV1 {
-  /** Retrieval metadata of AAD used for encryption of structures **/
-  1: optional binary aad_metadata
-
-  /** If file IVs are comprised of a fixed part, and variable parts
-   *  (e.g. counter), keep the fixed part here **/
-  2: optional binary gcm_iv_prefix
-
-  3: optional binary ctr_iv_prefix
-}
-
-union EncryptionAlgorithm {
-  1: AesGcmV1 AES_GCM_V1
-  2: AesGcmCtrV1 AES_GCM_CTR_V1
+  7: optional list column_orders
+  
+  /** Set in encrypted files with plaintext footer **/
+  8: optional EncryptionAlgorithm encryption_algorithm
 }
 
 struct FileCryptoMetaData {
   1: required EncryptionAlgorithm encryption_algorithm
-  
-  /** Parquet footer can be encrypted, or left as plaintext **/
-  2: required bool encrypted_footer
 
   /** Retrieval metadata of key used for encryption of footer, 
*  and (possibly) columns **/
-  3: optional binary footer_key_metadata
+  2: optional binary footer_key_metadata
 
-  /** Offset of Parquet footer (encrypted, or plaintext) **/
-  4: required i64 footer_offset
+  /** Offset of encrypted Parquet footer **/
+  3: required i64 footer_offset
 }
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable old readers to access unencrypted columns in files with plaintext 
> footer
> ---
>
> Key: PARQUET-1419
> URL: https://issues.apache.org/jira/browse/PARQUET-1419
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format, parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1425) [Format] Fix Thrift compiler warning

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1425:
---
Fix Version/s: (was: format-2.6.0)

> [Format] Fix Thrift compiler warning
> 
>
> Key: PARQUET-1425
> URL: https://issues.apache.org/jira/browse/PARQUET-1425
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Wes McKinney
>Priority: Major
>
> I see this warning frequently
> {code}
> [1/127] Running thrift compiler on parquet.thrift
> [WARNING:/home/wesm/code/arrow/cpp/src/parquet/parquet.thrift:295] The "byte" 
> type is a compatibility alias for "i8". Use "i8" to emphasize the signedness 
> of this type.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1428) Move columnar encryption into its feature branch

2018-09-27 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1428:
--

 Summary: Move columnar encryption into its feature branch
 Key: PARQUET-1428
 URL: https://issues.apache.org/jira/browse/PARQUET-1428
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1428) Move columnar encryption into its feature branch

2018-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630226#comment-16630226
 ] 

ASF GitHub Bot commented on PARQUET-1428:
-

zivanfi closed pull request #107: PARQUET-1428: Move columnar encryption into 
its feature branch 
URL: https://github.com/apache/parquet-format/pull/107
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index c05e871b..6c9011b9 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -662,22 +662,6 @@ struct ColumnMetaData {
   13: optional list encoding_stats;
 }
 
-struct EncryptionWithFooterKey {
-}
-
-struct EncryptionWithColumnKey {
-  /** Column path in schema **/
-  1: required list path_in_schema
-  
-  /** Retrieval metadata of the column-specific key **/
-  2: optional binary column_key_metadata
-}
-
-union ColumnCryptoMetaData {
-  1: EncryptionWithFooterKey ENCRYPTION_WITH_FOOTER_KEY
-  2: EncryptionWithColumnKey ENCRYPTION_WITH_COLUMN_KEY
-}
-
 struct ColumnChunk {
   /** File where column data is stored.  If not set, assumed to be same file as
 * metadata.  This path is relative to the current file.
@@ -704,9 +688,6 @@ struct ColumnChunk {
 
   /** Size of ColumnChunk's ColumnIndex, in bytes **/
   7: optional i32 column_index_length
-  
-  /** Crypto metadata of encrypted columns **/
-  8: optional ColumnCryptoMetaData crypto_meta_data
 }
 
 struct RowGroup {
@@ -725,13 +706,6 @@ struct RowGroup {
* The sorting columns can be a subset of all the columns.
*/
   4: optional list sorting_columns
-
-  /** Byte offset from beginning of file to first page (data or dictionary)
-   * in this row group **/
-  5: optional i64 file_offset
-
-  /** Total byte size of all compressed column data in this row group **/
-  6: optional i64 total_compressed_size
 }
 
 /** Empty struct to signal the order defined by the physical or logical type */
@@ -905,43 +879,3 @@ struct FileMetaData {
   7: optional list column_orders;
 }
 
-struct AesGcmV1 {
-  /** Retrieval metadata of AAD used for encryption of pages and structures **/
-  1: optional binary aad_metadata
-
-  /** If file IVs are comprised of a fixed part, and variable parts
-   *  (e.g. counter), keep the fixed part here **/
-  2: optional binary iv_prefix
- 
-}
-
-struct AesGcmCtrV1 {
-  /** Retrieval metadata of AAD used for encryption of structures **/
-  1: optional binary aad_metadata
-
-  /** If file IVs are comprised of a fixed part, and variable parts
-   *  (e.g. counter), keep the fixed part here **/
-  2: optional binary gcm_iv_prefix
-
-  3: optional binary ctr_iv_prefix
-}
-
-union EncryptionAlgorithm {
-  1: AesGcmV1 AES_GCM_V1
-  2: AesGcmCtrV1 AES_GCM_CTR_V1
-}
-
-struct FileCryptoMetaData {
-  1: required EncryptionAlgorithm encryption_algorithm
-  
-  /** Parquet footer can be encrypted, or left as plaintext **/
-  2: required bool encrypted_footer
-
-  /** Retrieval metadata of key used for encryption of footer, 
-   *  and (possibly) columns **/
-  3: optional binary footer_key_metadata
-
-  /** Offset of Parquet footer (encrypted, or plaintext) **/
-  4: required i64 footer_offset
-}
-


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Move columnar encryption into its feature branch
> 
>
> Key: PARQUET-1428
> URL: https://issues.apache.org/jira/browse/PARQUET-1428
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1428) Move columnar encryption into its feature branch

2018-09-27 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1428:

Labels: pull-request-available  (was: )

> Move columnar encryption into its feature branch
> 
>
> Key: PARQUET-1428
> URL: https://issues.apache.org/jira/browse/PARQUET-1428
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1427) [C++] Move example executables and CLI tools to Apache Arrow repo

2018-09-27 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1427.
---
Resolution: Fixed

Merged in 
https://github.com/apache/arrow/commit/723a437802143fd00d97c101caaf6deabea3f8c6

> [C++] Move example executables and CLI tools to Apache Arrow repo
> -
>
> Key: PARQUET-1427
> URL: https://issues.apache.org/jira/browse/PARQUET-1427
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1227) Thrift crypto metadata structures

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1227:
---
Fix Version/s: (was: format-2.6.0)

> Thrift crypto metadata structures
> -
>
> Key: PARQUET-1227
> URL: https://issues.apache.org/jira/browse/PARQUET-1227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> New Thrift structures for Parquet modular encryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1398) Separate iv_prefix for GCM and CTR modes

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1398:
---
Fix Version/s: (was: format-2.6.0)

> Separate iv_prefix for GCM and CTR modes
> 
>
> Key: PARQUET-1398
> URL: https://issues.apache.org/jira/browse/PARQUET-1398
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is an ambiguity in what the iv_prefix applies to - GCM or CTR or both. 
> This parameter will be moved it to the Algorithms structures (from the 
> FileCryptoMetaData structure).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1424) Release parquet-format 2.6.0

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1424:
---
Fix Version/s: (was: format-2.6.0)

> Release parquet-format 2.6.0
> 
>
> Key: PARQUET-1424
> URL: https://issues.apache.org/jira/browse/PARQUET-1424
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Priority: Major
>
> Release parquet-format 2.6.0
> The release requires reverting of the merged PRs related to columnar 
> encryption, since there's no signed spec yet. Those PRs should be developed 
> on a feature branch instead and merge to master once the spec is signed and 
> the format changes are ready to get released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1398) Separate iv_prefix for GCM and CTR modes

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1398:
---
Fix Version/s: format-encryption-feature-branch

> Separate iv_prefix for GCM and CTR modes
> 
>
> Key: PARQUET-1398
> URL: https://issues.apache.org/jira/browse/PARQUET-1398
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>  Labels: pull-request-available
> Fix For: format-encryption-feature-branch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is an ambiguity in what the iv_prefix applies to - GCM or CTR or both. 
> This parameter will be moved it to the Algorithms structures (from the 
> FileCryptoMetaData structure).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1227) Thrift crypto metadata structures

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1227:
---
Fix Version/s: format-encryption-feature-branch

> Thrift crypto metadata structures
> -
>
> Key: PARQUET-1227
> URL: https://issues.apache.org/jira/browse/PARQUET-1227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0, format-encryption-feature-branch
>
>
> New Thrift structures for Parquet modular encryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Exception while writing a Parquet File in a secured Cluster

2018-09-27 Thread Deepak Majeti
Hi Srinivas,

We have been using swebhdfs in Vertica to support secured (Kerberos + SSL)
Hadoop clusters without any issues.
I don't think this is a Parquet issue.
One alternative is to use command-line curl with verbose logging to see if
something shows up.

On Tue, Sep 25, 2018 at 11:06 PM Srinivas M  wrote:

> Hi Ryan, Thanks a lot for taking time to respond to my email and providing
> your perspective. Yes, the FileSystem could be accessed from outside
> through the hadoop and as well as Hive. But, when we are trying to access
> it through webhdfs (and swebhdfs as well), we are running into issues.
>
> While using the protocol as webhdfs, we were seeing the following error.
>
> Caused by: java.net.SocketException: bda6node02.infoftps.com:14000:
> Unexpected end of file from server
>at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:86)
>at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:58)
>at
> java.lang.reflect.Constructor.newInstance(Constructor.java:542)
>at
>
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:691)
>at
>
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:519)
>at
>
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:549)
>at
> java.security.AccessController.doPrivileged(AccessController.java:488)
>at javax.security.auth.Subject.doAs(Subject.java:572)
>
> After investigating into the issue, it has been identified that the issue
> was due to the mismatch in the protocol (server is configured with SSL) and
> the client application was doing a plain request and hence the error. So, I
> had switched the protocol to swebhdfs and then we started seeing this new
> error (which was mentioned in the earlier mail).
>
> Are there any additional debugging that could be enabled to understand what
> could be failing ? I could not make out much from the Kerberos and SSL
> Debug logs.
>
> On a side note, is the swebhdfs implementation fully stable and is it
> expected to work for Parquet, when accessing files over secure HDFS ?
>
> Thanks Once again for taking time to respond to my questions.
>
> On Mon, Sep 24, 2018 at 9:55 PM Ryan Blue 
> wrote:
>
> > This is probably related to the fact that your FS is getting created
> inside
> > a call to Parquet (org.apache.hadoop.fs.FileSystem.create). Can you
> access
> > that target file system first to make sure it is set up properly?
> >
> > It could be that Parquet isn't handling Configuration correctly in this
> > stack.
> >
> > rb
> >
> > On Mon, Sep 24, 2018 at 8:30 AM Srinivas M  wrote:
> >
> > >  Hi
> > >
> > > We have an application that writes parquet files. I am using the
> > > AvroParquetWriter to write parquet files. While this piece of code
> works
> > > fine in a Kerberos environment, it is failing when SSL is enabled in
> the
> > > Hadoop cluster. So, I had modified the code to use the swebhdfs
> protocol
> > > instead of the webhdfs and it is still failing with the following
> > > exception.
> > >
> > >{
> > >  conf.set("hadoop.security.authentication", "kerberos");
> > >  UserGroupInformation.setConfiguration(conf);
> > >  ugi =
> > > UserGroupInformation.loginUserFromKeytabAndReturnUGI(_user,_keytab) ;
> > >
> > >  try
> > >  {
> > > ugi.doAs(new PrivilegedExceptionAction()
> > > {
> > >   public Object run() throws IOException
> > >   {
> > >  fs = FileSystem.get(hdfsuri,conf) ;
> > >
> > >   if (_fileExistsAction ==
> VAL_FILEEXISTS_OVERWRITE)
> > >  fs.delete(new Path(_fileName),false) ;
> > >
> > >  _writer = new AvroParquetWriter(new Path(hdfsuri),
> > > _schema, _ParquetCompressionCodec, _ParquetBlockSize,
> _ParquetPageSize);
> > > return null ;
> > >   }
> > > }
> > >   );
> > >
> > > hdfsuri in this case is of the format "swebhdfs://"+_host+":"+_port +
> > "/" +
> > > _fileName"
> > >
> > > *The application is failing with the following exception :*
> > > *===*
> > > Error :
> > > Caused by
> > > org.apache.hadoop.ipc.RemoteException(javax.ws.rs
> > > .WebApplicationException):
> > > null
> > > at
> > >
> org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:124)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:420)
> > > at
> > >
> > >
> >
> 

[jira] [Created] (PARQUET-1429) Turn off DocLint on parquet-format

2018-09-27 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1429:
--

 Summary: Turn off DocLint on parquet-format
 Key: PARQUET-1429
 URL: https://issues.apache.org/jira/browse/PARQUET-1429
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar
Assignee: Nandor Kollar
 Fix For: format-2.6.0


DocLint is introduced in Java 8, and since the generated code in parquet-format 
has several issues found by DocLint, attach-javadocs goal will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1426) [C++] parquet-dump-schema has poor usability

2018-09-27 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630484#comment-16630484
 ] 

Deepak Majeti commented on PARQUET-1426:


We should add tests for all the tools as well. I will open a Jira for that.

> [C++] parquet-dump-schema has poor usability
> 
>
> Key: PARQUET-1426
> URL: https://issues.apache.org/jira/browse/PARQUET-1426
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> {code}
> $ ./debug/parquet-dump-schema
> terminate called after throwing an instance of 'std::logic_error'
>   what():  basic_string::_S_construct null not valid
> Aborted (core dumped)
> $ ./debug/parquet-dump-schema --help
> Parquet error: Arrow error: IOError: ../src/arrow/io/file.cc:508 code: 
> result->memory_map_->Open(path, mode)
> ../src/arrow/io/file.cc:380 code: file_->OpenReadable(path)
> ../src/arrow/io/file.cc:99 code: internal::FileOpenReadable(file_name_, _)
> Failed to open local file: --help , error: No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1429) Turn off DocLint on parquet-format

2018-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630483#comment-16630483
 ] 

ASF GitHub Bot commented on PARQUET-1429:
-

nandorKollar opened a new pull request #108: PARQUET-1429: Turn off DocLint on 
parquet-format
URL: https://github.com/apache/parquet-format/pull/108
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Turn off DocLint on parquet-format
> --
>
> Key: PARQUET-1429
> URL: https://issues.apache.org/jira/browse/PARQUET-1429
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.6.0
>
>
> DocLint is introduced in Java 8, and since the generated code in 
> parquet-format has several issues found by DocLint, attach-javadocs goal will 
> fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1429) Turn off DocLint on parquet-format

2018-09-27 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1429:

Labels: pull-request-available  (was: )

> Turn off DocLint on parquet-format
> --
>
> Key: PARQUET-1429
> URL: https://issues.apache.org/jira/browse/PARQUET-1429
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.6.0
>
>
> DocLint is introduced in Java 8, and since the generated code in 
> parquet-format has several issues found by DocLint, attach-javadocs goal will 
> fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1430) [C++] Add tests for C++ tools

2018-09-27 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1430:
--

 Summary: [C++] Add tests for C++ tools
 Key: PARQUET-1430
 URL: https://issues.apache.org/jira/browse/PARQUET-1430
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: 1.6.1


We currently do not have any tests for the tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1429) Turn off DocLint on parquet-format

2018-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630512#comment-16630512
 ] 

ASF GitHub Bot commented on PARQUET-1429:
-

zivanfi closed pull request #108: PARQUET-1429: Turn off DocLint on 
parquet-format
URL: https://github.com/apache/parquet-format/pull/108
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/pom.xml b/pom.xml
index 0b0c1141..5d35eccc 100644
--- a/pom.xml
+++ b/pom.xml
@@ -121,6 +121,13 @@
   
 
   
+  
+org.apache.maven.plugins
+maven-javadoc-plugin
+
+  -Xdoclint:none
+
+  
   
 
 org.apache.maven.plugins


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Turn off DocLint on parquet-format
> --
>
> Key: PARQUET-1429
> URL: https://issues.apache.org/jira/browse/PARQUET-1429
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.6.0
>
>
> DocLint is introduced in Java 8, and since the generated code in 
> parquet-format has several issues found by DocLint, attach-javadocs goal will 
> fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1429) Turn off DocLint on parquet-format

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1429.

Resolution: Fixed

> Turn off DocLint on parquet-format
> --
>
> Key: PARQUET-1429
> URL: https://issues.apache.org/jira/browse/PARQUET-1429
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.6.0
>
>
> DocLint is introduced in Java 8, and since the generated code in 
> parquet-format has several issues found by DocLint, attach-javadocs goal will 
> fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[VOTE] Release Apache Parquet format 2.6.0 RC0

2018-09-27 Thread Nandor Kollar
Hi everyone,

I propose the following RC to be released as official Apache Parquet
Format 2.6.0 release.

The commit id is df6132b94f273521a418a74442085fdd5a0aa009
* This corresponds to the tag: apache-parquet-format-2.6.0
* 
https://github.com/apache/parquet-format/tree/df6132b94f273521a418a74442085fdd5a0aa009
* 
https://gitbox.apache.org/repos/asf?p=parquet-format.git;a=commit;h=df6132b94f273521a418a74442085fdd5a0aa009

The release tarball, signature, and checksums are here:
* https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.6.0-rc0

You can find the KEYS file here:
* https://dist.apache.org/repos/dist/dev/parquet/KEYS

Binary artifacts are staged in Nexus here:
* 
https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.6.0

This release includes following changes:

PARQUET-1266 - LogicalTypes union in parquet-format doesn't include UUID
PARQUET-1290 - Clarify maximum run lengths for RLE encoding
PARQUET-1387 - Nanosecond precision time and timestamp - parquet-format
PARQUET-1400 - Deprecate parquet-mr related code in parquet-format
PARQUET-1429 - Turn off DocLint on parquet-format

Please download, verify, and test.

The voting will be open at least for 72 hour from now.

[ ] +1 Release this as Apache Parquet Format 2.6.0
[ ] +0
[ ] -1 Do not release this because...

Thanks,
Nandor


Re: [VOTE] Release Apache Parquet format 2.6.0 RC0

2018-09-27 Thread Zoltan Ivanfi
+1 (binding)

- contents look good
- units tests pass
- checksums match
- signature matches

Thanks,

Zoltan

On Thu, Sep 27, 2018 at 5:02 PM Nandor Kollar 
wrote:

> Hi everyone,
>
> I propose the following RC to be released as official Apache Parquet
> Format 2.6.0 release.
>
> The commit id is df6132b94f273521a418a74442085fdd5a0aa009
> * This corresponds to the tag: apache-parquet-format-2.6.0
> *
> https://github.com/apache/parquet-format/tree/df6132b94f273521a418a74442085fdd5a0aa009
> *
> https://gitbox.apache.org/repos/asf?p=parquet-format.git;a=commit;h=df6132b94f273521a418a74442085fdd5a0aa009
>
> The release tarball, signature, and checksums are here:
> *
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.6.0-rc0
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/parquet/KEYS
>
> Binary artifacts are staged in Nexus here:
> *
> https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.6.0
>
> This release includes following changes:
>
> PARQUET-1266 - LogicalTypes union in parquet-format doesn't include UUID
> PARQUET-1290 - Clarify maximum run lengths for RLE encoding
> PARQUET-1387 - Nanosecond precision time and timestamp - parquet-format
> PARQUET-1400 - Deprecate parquet-mr related code in parquet-format
> PARQUET-1429 - Turn off DocLint on parquet-format
>
> Please download, verify, and test.
>
> The voting will be open at least for 72 hour from now.
>
> [ ] +1 Release this as Apache Parquet Format 2.6.0
> [ ] +0
> [ ] -1 Do not release this because...
>
> Thanks,
> Nandor
>


Re: parquet sync notes

2018-09-27 Thread Zoltan Ivanfi
Hi,

I have created the feature branches:

- https://github.com/apache/parquet-mr/tree/bloom-filter
- https://github.com/apache/parquet-format/tree/bloom-filter

- https://github.com/apache/parquet-mr/tree/encryption
- https://github.com/apache/parquet-format/tree/encryption

I have also cherry-picked the encryption commits to the latter one.

Br,

Zoltan

On Wed, Sep 26, 2018 at 10:29 AM 俊杰陈  wrote:

> Hi Zoltan
>
> PR #62 contains some rebase info which is not relate to change itself so I
> created PR#99. Actually it only contains one file change now, I will add
> another document file later.
>
> Zoltan Ivanfi  于2018年9月26日周三 下午3:19写道:
>
> > Hi,
> >
> > It seems to me that PR #99 does not supersede PR #62, as the latter
> affects
> > 16 files but the former only modifies a single one. Or has the rest of
> the
> > changes been already merged to the codebase from another PR? I checked
> the
> > history and I don't see anything related.
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Wed, Sep 26, 2018 at 4:25 AM 俊杰陈  wrote:
> >
> > > Hi
> > >
> > > the pr28 and pr62 of parquet-format was closed. Will we create a
> feature
> > > branch for bloom filter on parquet-mr as well?
> > >
> > > Julien Le Dem  于2018年9月26日周三
> 上午12:48写道:
> > >
> > > > Lars (Cloudera Impala): listen in.
> > > > Zoltan, Gabor and Nandor (Cloudera):
> > > >
> > > >- feature branch reviewed and merged
> > > >- Parquet-format release
> > > >-
> > > >   - Define scope
> > > >
> > > > Ryan (Netflix)
> > > > Junjie (tencent): bloom filter
> > > > Jim Apple (cloud service): bloom filter in parquet-mr? Since they got
> > in
> > > > parquet-cpp
> > > > Gidon (IBM): encrytpion
> > > > Sahil (Cloudera impala, hive): listen in
> > > > Julien (Wework)
> > > >
> > > > Status update from Gabor:
> > > >
> > > >-  Waiting for reviews.
> > > >-
> > > >   - Plan to merge this Friday.
> > > >   - Please review in the next few days.
> > > >
> > > > Parquet format release:
> > > >
> > > >- Nanosecond precision
> > > >- Deprecation of java related code
> > > >- Encryption metadata
> > > >-
> > > >   - One more pr to merge
> > > >   - Plan:
> > > >-
> > > >   - Revert the encryption patches and put them in feature branch
> in
> > > >   parquet-format
> > > >   - Apply the same process to bloom filters
> > > >   - Owner of pr can update it to the feature branch
> > > >
> > > >
> > > > Encryption:
> > > >
> > > >- Old readers can read non encrypted columns
> > > >-
> > > >   - Changes to metadata
> > > >   - One last PR on parquet-format
> > > >   - We should have a vote before merging it.
> > > >- Make sure parquet-cpp depends on the source of truth thrift in
> > > >parquet-format.
> > > >
> > > >
> > > > Bloom filter:
> > > >
> > > >- parquet-format/62 and parquet-format/99
> > > >- parquet-format/28: should be closed as is outdated. We should
> port
> > > the
> > > >doc to the more recent PR.
> > > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>
>
> --
> Thanks & Best Regards
>


[jira] [Updated] (PARQUET-1430) [C++] Add tests for C++ tools

2018-09-27 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1430:
---
Fix Version/s: (was: 1.6.1)
   cpp-1.6.0

> [C++] Add tests for C++ tools
> -
>
> Key: PARQUET-1430
> URL: https://issues.apache.org/jira/browse/PARQUET-1430
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> We currently do not have any tests for the tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1431) [C++] Automaticaly set thrift to use boost for thrift versions before 0.11

2018-09-27 Thread Deepak Majeti (JIRA)
Deepak Majeti created PARQUET-1431:
--

 Summary: [C++] Automaticaly set thrift to use boost for thrift 
versions before 0.11
 Key: PARQUET-1431
 URL: https://issues.apache.org/jira/browse/PARQUET-1431
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: cpp-1.6.0


PARQUET_THRIFT_USE_BOOST is a cmake option. But instead parquet should 
automatically set the definition PARQUET_THRIFT_USE_BOOST based on the thrift 
version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1431) [C++] Automaticaly set thrift to use boost for thrift versions before 0.11

2018-09-27 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1431:

Labels: pull-request-available  (was: )

> [C++] Automaticaly set thrift to use boost for thrift versions before 0.11
> --
>
> Key: PARQUET-1431
> URL: https://issues.apache.org/jira/browse/PARQUET-1431
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>
> PARQUET_THRIFT_USE_BOOST is a cmake option. But instead parquet should 
> automatically set the definition PARQUET_THRIFT_USE_BOOST based on the thrift 
> version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1160) [C++] Implement BYTE_ARRAY-backed Decimal reads

2018-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631112#comment-16631112
 ] 

ASF GitHub Bot commented on PARQUET-1160:
-

wesm closed pull request #495: PARQUET-1160: [C++] Implement BYTE_ARRAY-backed 
Decimal reads
URL: https://github.com/apache/parquet-cpp/pull/495
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/data/byte_array_decimal.parquet b/data/byte_array_decimal.parquet
new file mode 100644
index ..798cb2aa
Binary files /dev/null and b/data/byte_array_decimal.parquet differ
diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc 
b/src/parquet/arrow/arrow-reader-writer-test.cc
index 5f4e1234..30dbf4ad 100644
--- a/src/parquet/arrow/arrow-reader-writer-test.cc
+++ b/src/parquet/arrow/arrow-reader-writer-test.cc
@@ -2316,11 +2316,11 @@ TEST(TestArrowReaderAdHoc, Int96BadMemoryAccess) {
   ASSERT_OK_NO_THROW(arrow_reader->ReadTable());
 }
 
-class TestArrowReaderAdHocSpark
+class TestArrowReaderAdHocSparkAndHvr
 : public ::testing::TestWithParam<
   std::tuple>> {};
 
-TEST_P(TestArrowReaderAdHocSpark, ReadDecimals) {
+TEST_P(TestArrowReaderAdHocSparkAndHvr, ReadDecimals) {
   std::string path(test::get_data_dir());
 
   std::string filename;
@@ -2364,12 +2364,13 @@ TEST_P(TestArrowReaderAdHocSpark, ReadDecimals) {
 }
 
 INSTANTIATE_TEST_CASE_P(
-ReadDecimals, TestArrowReaderAdHocSpark,
+ReadDecimals, TestArrowReaderAdHocSparkAndHvr,
 ::testing::Values(
 std::make_tuple("int32_decimal.parquet", ::arrow::decimal(4, 2)),
 std::make_tuple("int64_decimal.parquet", ::arrow::decimal(10, 2)),
 std::make_tuple("fixed_length_decimal.parquet", ::arrow::decimal(25, 
2)),
-std::make_tuple("fixed_length_decimal_legacy.parquet", 
::arrow::decimal(13, 2;
+std::make_tuple("fixed_length_decimal_legacy.parquet", 
::arrow::decimal(13, 2)),
+std::make_tuple("byte_array_decimal.parquet", ::arrow::decimal(4, 
2;
 
 }  // namespace arrow
 
diff --git a/src/parquet/arrow/reader.cc b/src/parquet/arrow/reader.cc
index 2e4dc815..de7261ce 100644
--- a/src/parquet/arrow/reader.cc
+++ b/src/parquet/arrow/reader.cc
@@ -1220,6 +1220,66 @@ struct TransferFunctor<::arrow::Decimal128Type, 
FLBAType> {
   }
 };
 
+/// \brief Convert an arrow::BinaryArray to an arrow::Decimal128Array
+/// We do this by:
+/// 1. Creating an arrow::BinaryArray from the RecordReader's builder
+/// 2. Allocating a buffer for the arrow::Decimal128Array
+/// 3. Converting the big-endian bytes in each BinaryArray entry to two 
integers
+///representing the high and low bits of each decimal value.
+template <>
+struct TransferFunctor<::arrow::Decimal128Type, ByteArrayType> {
+  Status operator()(RecordReader* reader, MemoryPool* pool,
+const std::shared_ptr<::arrow::DataType>& type,
+std::shared_ptr* out) {
+DCHECK_EQ(type->id(), ::arrow::Type::DECIMAL);
+
+// Finish the built data into a temporary array
+std::shared_ptr array;
+RETURN_NOT_OK(reader->builder()->Finish());
+const auto& binary_array = static_cast(*array);
+
+const int64_t length = binary_array.length();
+
+const auto& decimal_type = static_cast(*type);
+const int64_t type_length = decimal_type.byte_width();
+
+std::shared_ptr data;
+RETURN_NOT_OK(::arrow::AllocateBuffer(pool, length * type_length, ));
+
+// raw bytes that we can write to
+uint8_t* out_ptr = data->mutable_data();
+
+const int64_t null_count = binary_array.null_count();
+
+// convert each BinaryArray value to valid decimal bytes
+for (int64_t i = 0; i < length; i++, out_ptr += type_length) {
+
+  int32_t record_len = 0;
+  const uint8_t *record_loc = binary_array.GetValue(i, _len);
+
+  if ((record_len < 0) || (record_len > type_length)) {
+return Status::Invalid("Invalid BYTE_ARRAY size");
+  }
+
+  auto out_ptr_view = reinterpret_cast(out_ptr);
+  out_ptr_view[0] = 0;
+  out_ptr_view[1] = 0;
+  
+  // only convert rows that are not null if there are nulls, or
+  // all rows, if there are not
+  if (((null_count > 0) && !binary_array.IsNull(i)) || (null_count <= 0)) {
+RawBytesToDecimalBytes(record_loc, record_len, out_ptr);
+  }
+}
+
+*out = std::make_shared<::arrow::Decimal128Array>(
+type, length, data, binary_array.null_bitmap(), null_count);
+
+return Status::OK();
+  }
+};
+
+
 /// \brief Convert an Int32 or Int64 array into a Decimal128Array
 /// The parquet spec allows systems to write decimals in int32, int64 if the 
values are
 /// small enough to fit in less 4 bytes or less than 8 

[jira] [Commented] (PARQUET-1201) Column indexes

2018-09-27 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631130#comment-16631130
 ] 

Ryan Blue commented on PARQUET-1201:


[~gszadovszky], where is the branch for page skipping? Is it this one? 
https://github.com/apache/parquet-mr/tree/column-indexes

I just went to review it, but I don't see a PR. Could you open one against 
master?

> Column indexes
> --
>
> Key: PARQUET-1201
> URL: https://issues.apache.org/jira/browse/PARQUET-1201
> Project: Parquet
>  Issue Type: New Feature
>Affects Versions: 1.10.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>
> Write the column indexes described in PARQUET-922.
>  This is the first phase of implementing the whole feature. The 
> implementation is done in the following steps:
>  * Utility to read/write indexes in parquet-format
>  * Writing indexes in the parquet file
>  * Extend parquet-tools and parquet-cli to show the indexes
>  * Limit index size based on parquet properties
>  * Trim min/max values where possible based on parquet properties
>  * Filtering based on column indexes
> The work is done on the feature branch {{column-indexes}}. This JIRA will be 
> resolved after the branch has been merged to {{master}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1160) [C++] Implement BYTE_ARRAY-backed Decimal reads

2018-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631157#comment-16631157
 ] 

ASF GitHub Bot commented on PARQUET-1160:
-

thaining opened a new pull request #1: PARQUET-1160: [C++] Implement 
BYTE_ARRAY-backed Decimal reads
URL: https://github.com/apache/parquet-testing/pull/1
 
 
   This change adds a data file with BYTE_ARRAY-backed decimals for unit 
testing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement BYTE_ARRAY-backed Decimal reads
> ---
>
> Key: PARQUET-1160
> URL: https://issues.apache.org/jira/browse/PARQUET-1160
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Affects Versions: cpp-1.3.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20180726193815980.parquet
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> These are valid in the parquet spec, but it seems like no system in use today 
> implements a writer for this type.
> What systems support writing Decimals with this underlying type?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


widening primitive conversion in parquet dictionary

2018-09-27 Thread Swapnil Chougule
Hi

Is there widening primitive conversion support in parquet dictionary?

I could see only same type methods are implemented in dictionary
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/PlainValuesDictionary.java

I came across a case where long data needs to be read as double.
PlainLongDictionary is being created for same. This dictionary has only
implementation for 'decodeToLong'.
Can we have 'decodeToDouble' implentation as well here? (as long to double
is widening primitive conversion). Same scenarios can be replicated for
other supported(widening primitive conversion) types.

Thanks,
Swapnil