[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969483#comment-16969483
 ] 

Xinli Shang commented on PARQUET-1681:
--

Hi [~rdblue], do you still remember or document it somewhere what are the 
several problems that  PARQUET-651 solved? I am evaluating should we roll back 
PARQUET-651  or roll forward. 

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969493#comment-16969493
 ] 

Ryan Blue commented on PARQUET-1681:


Looks like it might be https://issues.apache.org/jira/browse/AVRO-2400.

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969491#comment-16969491
 ] 

Ryan Blue commented on PARQUET-1681:


I think we should be able to work around this instead of reverting PARQUET-651. 
If the compatibility check requires that the name matches, then we should be 
able to ensure that the name matches when converting the Parquet schema to Avro.

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang reassigned PARQUET-1681:


Assignee: Xinli Shang

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969489#comment-16969489
 ] 

Ryan Blue commented on PARQUET-1681:


The Avro check should ignore record names if the record is the root. Has this 
check changed in Avro recently?

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969483#comment-16969483
 ] 

Xinli Shang edited comment on PARQUET-1681 at 11/7/19 6:28 PM:
---

Hi [~rdblue], do you still remember or document it somewhere what are the 
several problems that  PARQUET-651 solved? Is SPARK-16344 the only problem 
sovled? I am evaluating should we roll back PARQUET-651  or roll forward. 


was (Author: sha...@uber.com):
Hi [~rdblue], do you still remember or document it somewhere what are the 
several problems that  PARQUET-651 solved? I am evaluating should we roll back 
PARQUET-651  or roll forward. 

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-11-07 Thread Gabor Szadovszky
+1 for adding BYTE_STREAM_SPLIT encoding to parquet-format.

On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney  wrote:

> +1 from me on adding the FP encoding
>
> On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin  wrote:
> >
> > Hello all,
> >
> >
> > thanks for the vote Ryan and to Wes for the feedback.
> >
> >
> > The concern with regards to adding more complex features in the Parquet
> spec is valid.
> >
> > However, the proposed encoding is very simple and I already have
> unpolished patches for both parquet-mr and arrow.
> >
> > In its design I purposely opted for something simple to guarantee 1)
> good compression speed and 2) ease of implementation.
> >
> >
> > On the topic of testing, I added four more test cases which were taken
> from here. I also added the size in MB of
> all test case and entropy per element.
> >
> > Having the entropy reported helps show that the encoding performs better
> than any other option for high-entropy data and not so well for low-entropy
> data.
> >
> >
> > I would be happy to receive some more feedback and votes.
> >
> >
> > Kind regards,
> >
> > Martin
> >
> > 
> > From: Ryan Blue 
> > Sent: Friday, November 1, 2019 6:28 PM
> > To: Parquet Dev
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> >
> > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format.
> Looks
> > like it is simple and performs well. We could use a good floating point
> > encoding.
> >
> > I don't think I agree that differences in features between the Java and
> CPP
> > implementations should hold back new work. It would be great to have more
> > testing and validation, as well as more thorough implementations. But I
> > don't think we shouldn't accept contributions like this because of those
> > concerns.
> >
> > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney  wrote:
> >
> > > I have to say I'm struggling with piling more things into the Parquet
> > > specification when we already have a significant implementation
> > > shortfall in other areas. LZ4 is still not properly implemented for
> > > example, and then there is the question of the V2 encodings and data
> > > page formats.
> > >
> > > I'm generally in favor of adding more efficient storage of floating
> > > point data, but will it actually be implemented broadly? Parquet as a
> > > format already has become an "implementation swamp" where any two
> > > implementations may not be compatible with each other, particularly in
> > > consideration of more advanced features.
> > >
> > > For a single organization using a single implementation, having
> > > advanced features may be useful, so I see the benefits to users that
> > > tightly control what code and what settings to use.
> > >
> > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin 
> wrote:
> > > >
> > > > Dear all,
> > > >
> > > >
> > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT
> encoding?
> > > >
> > > > Please feel free to contact me directly if you need help or would
> like
> > > to provide more test data.
> > > >
> > > >
> > > > Results for the encoding based on the implementation in Arrow are
> here:
> > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > Patch to Arrow is here:
> > >
> https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > > >
> > > >
> > > > The new encoding combined with a compressor performs better than any
> of
> > > the other alternatives for data where there is little change in the
> > > upper-most bytes of fp32 and fp64 values. My early experiments also
> show
> > > that this encoding+zstd performs better on average than any of the
> > > specialized floating-point lossless compressors like fpc, spdp, zfp.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > 
> > > > From: Radev, Martin 
> > > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > > To: Parquet Dev
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > >
> > > > Dear Ryan Blue and other Parquet developers,
> > > >
> > > > I tested Ryan's proposal for modifying the encoding.
> > > >
> > > > The short answer is that it doesn't perform well in my tests. The
> > > encoding, code and results can be viewed below.
> > > >
> > > >
> > > > The current implementation only handles 32-bit IEEE754 floats in the
> > > following way:
> > > >
> > > >   1.  For each block of 128 values, the min and max is computed for
> the
> > > exponent
> > > >   2.  The number of bits for the exponent RLE is computed as
> > > ceil(log2((max - min + 1))). The sign bit uses 1 bit.
> > > >   3.  The sign, exponent and 23 remaining mantissa bits are
> extracted.
> > > >   4.  One RLE encoder is used for the sign and one for the exponent.
> > > > A new RLE encoder for the exponent is created if the block requires

[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Fokko Driesprong (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969911#comment-16969911
 ] 

Fokko Driesprong commented on PARQUET-1681:
---

[~rdblue] AVRO-2400 was a huge regression bug indeed. But this is with Avro 
1.9.x which hasn't been released with Parquet. Parquet 1.10.1 still runs on 
Avro 1.8.2. Parquet 1.11.0 is the first version that targets Avro 1.9.1.

This ticket states that 1.9.1 also is affected, but this bug has been resolved 
in there. [~sha...@uber.com] Did you check with 1.9.1 as well? Would it be 
possible to deduct a unit test? So we can reproduce the bug in the CI

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1645) Bump Apache Avro to 1.9.1

2019-11-07 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved PARQUET-1645.
---
Resolution: Duplicate

> Bump Apache Avro to 1.9.1
> -
>
> Key: PARQUET-1645
> URL: https://issues.apache.org/jira/browse/PARQUET-1645
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1645) Bump Apache Avro to 1.9.1

2019-11-07 Thread Fokko Driesprong (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969904#comment-16969904
 ] 

Fokko Driesprong commented on PARQUET-1645:
---

Thanks for tagging me in here. Allow me to elaborate.

If you both update Parquet and Avro it will be fine. The problem is that 
Parquet used Avro's Jackson public API, and this public API has been removed in 
the 1.9 branch. You can see it in the PR here: 
https://github.com/apache/parquet-mr/commit/9d6fb45e54da65cbd407bb3e7bff0981aa9f8f9f#diff-600376dffeb79835ede4a0b285078036

In Spark, I'd love to update Avro to the latest version since it contains a lot 
of updates and patches a lot of CVE's. This weekend I'll compile Spark with 
Parquet 1.11-SNAPSHOT and Avro 1.9 to see what we're running into.

> Bump Apache Avro to 1.9.1
> -
>
> Key: PARQUET-1645
> URL: https://issues.apache.org/jira/browse/PARQUET-1645
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1645) Bump Apache Avro to 1.9.1

2019-11-07 Thread Michael Heuer (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969351#comment-16969351
 ] 

Michael Heuer commented on PARQUET-1645:


I am very curious about this – Parquet vs Avro version incompatibilities have 
been a source of major headache for us downstream of Apache Spark.  Will Spark 
be able to accept Avro 1.9.1 and Parquet 1.11.0 upgrades simultaneously?

> Bump Apache Avro to 1.9.1
> -
>
> Key: PARQUET-1645
> URL: https://issues.apache.org/jira/browse/PARQUET-1645
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1688) [C++] StreamWriter/StreamReader can't be built with g++ 4.8.5 on CentOS 7

2019-11-07 Thread Kouhei Sutou (Jira)
Kouhei Sutou created PARQUET-1688:
-

 Summary: [C++] StreamWriter/StreamReader can't be built with g++ 
4.8.5 on CentOS 7
 Key: PARQUET-1688
 URL: https://issues.apache.org/jira/browse/PARQUET-1688
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Kouhei Sutou


cc [~gawain_bolton]

This is caused since PARQUET-1678 is merged.

It seems that g++ 4.8.5 on CentOS 7 doesn't have the default implementation of 
{{operator=() noexcept}}:

https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=2562=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=5469=5484=1=1

{noformat}
In file included from 
/root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:31:0,
 from 
/root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:
/root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
 error: function 'parquet::StreamWriter& 
parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
first declaration with an exception-specification that differs from the 
implicit declaration 'parquet::StreamWriter& 
parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
   StreamWriter& operator=(StreamWriter&&) noexcept = default;
 ^
In file included from 
/root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:0:
/root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:61:17:
 error: function 'parquet::StreamReader& 
parquet::StreamReader::operator=(parquet::StreamReader&&)' defaulted on its 
first declaration with an exception-specification that differs from the 
implicit declaration 'parquet::StreamReader& 
parquet::StreamReader::operator=(parquet::StreamReader&&)'
   StreamReader& operator=(StreamReader&&) noexcept = default;
 ^
make[2]: *** [src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o] 
Error 1
make[2]: *** Waiting for unfinished jobs
In file included from 
/root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.cc:18:0:
/root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
 error: function 'parquet::StreamWriter& 
parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
first declaration with an exception-specification that differs from the 
implicit declaration 'parquet::StreamWriter& 
parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
   StreamWriter& operator=(StreamWriter&&) noexcept = default;
 ^
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1688) [C++] StreamWriter/StreamReader can't be built with g++ 4.8.5 on CentOS 7

2019-11-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969731#comment-16969731
 ] 

Wes McKinney commented on PARQUET-1688:
---

I think this duplicates ARROW-7088

> [C++] StreamWriter/StreamReader can't be built with g++ 4.8.5 on CentOS 7
> -
>
> Key: PARQUET-1688
> URL: https://issues.apache.org/jira/browse/PARQUET-1688
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Kouhei Sutou
>Priority: Major
>
> cc [~gawain_bolton]
> This is caused since PARQUET-1678 is merged.
> It seems that g++ 4.8.5 on CentOS 7 doesn't have the default implementation 
> of {{operator=() noexcept}}:
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=2562=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=5469=5484=1=1
> {noformat}
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:31:0,
>  from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
>  error: function 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
>StreamWriter& operator=(StreamWriter&&) noexcept = default;
>  ^
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:0:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:61:17:
>  error: function 'parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)'
>StreamReader& operator=(StreamReader&&) noexcept = default;
>  ^
> make[2]: *** [src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o] 
> Error 1
> make[2]: *** Waiting for unfinished jobs
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.cc:18:0:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
>  error: function 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
>StreamWriter& operator=(StreamWriter&&) noexcept = default;
>  ^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

2019-11-07 Thread Junjie Chen
+1 from me to add BYTE_STREAM_SPLIT to parquet-format.

Gabor Szadovszky  于2019年11月7日周四 下午6:07写道:
>
> +1 for adding BYTE_STREAM_SPLIT encoding to parquet-format.
>
> On Tue, Nov 5, 2019 at 11:22 PM Wes McKinney  wrote:
>
> > +1 from me on adding the FP encoding
> >
> > On Sat, Nov 2, 2019 at 4:51 AM Radev, Martin  wrote:
> > >
> > > Hello all,
> > >
> > >
> > > thanks for the vote Ryan and to Wes for the feedback.
> > >
> > >
> > > The concern with regards to adding more complex features in the Parquet
> > spec is valid.
> > >
> > > However, the proposed encoding is very simple and I already have
> > unpolished patches for both parquet-mr and arrow.
> > >
> > > In its design I purposely opted for something simple to guarantee 1)
> > good compression speed and 2) ease of implementation.
> > >
> > >
> > > On the topic of testing, I added four more test cases which were taken
> > from here. I also added the size in MB of
> > all test case and entropy per element.
> > >
> > > Having the entropy reported helps show that the encoding performs better
> > than any other option for high-entropy data and not so well for low-entropy
> > data.
> > >
> > >
> > > I would be happy to receive some more feedback and votes.
> > >
> > >
> > > Kind regards,
> > >
> > > Martin
> > >
> > > 
> > > From: Ryan Blue 
> > > Sent: Friday, November 1, 2019 6:28 PM
> > > To: Parquet Dev
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > >
> > > I'm +1 for adding the definition of BYTE_STREAM_SPLIT to the format.
> > Looks
> > > like it is simple and performs well. We could use a good floating point
> > > encoding.
> > >
> > > I don't think I agree that differences in features between the Java and
> > CPP
> > > implementations should hold back new work. It would be great to have more
> > > testing and validation, as well as more thorough implementations. But I
> > > don't think we shouldn't accept contributions like this because of those
> > > concerns.
> > >
> > > On Fri, Nov 1, 2019 at 9:27 AM Wes McKinney  wrote:
> > >
> > > > I have to say I'm struggling with piling more things into the Parquet
> > > > specification when we already have a significant implementation
> > > > shortfall in other areas. LZ4 is still not properly implemented for
> > > > example, and then there is the question of the V2 encodings and data
> > > > page formats.
> > > >
> > > > I'm generally in favor of adding more efficient storage of floating
> > > > point data, but will it actually be implemented broadly? Parquet as a
> > > > format already has become an "implementation swamp" where any two
> > > > implementations may not be compatible with each other, particularly in
> > > > consideration of more advanced features.
> > > >
> > > > For a single organization using a single implementation, having
> > > > advanced features may be useful, so I see the benefits to users that
> > > > tightly control what code and what settings to use.
> > > >
> > > > On Thu, Oct 31, 2019 at 3:51 AM Radev, Martin 
> > wrote:
> > > > >
> > > > > Dear all,
> > > > >
> > > > >
> > > > > would there be any interest in reviewing the BYTE_STREAM_SPLIT
> > encoding?
> > > > >
> > > > > Please feel free to contact me directly if you need help or would
> > like
> > > > to provide more test data.
> > > > >
> > > > >
> > > > > Results for the encoding based on the implementation in Arrow are
> > here:
> > > > https://github.com/martinradev/arrow-fp-compression-bench
> > > > > Patch to Arrow is here:
> > > >
> > https://github.com/martinradev/arrow/commit/10de1e0f8a513b742edddeb6ba0d553617b1aa49
> > > > >
> > > > >
> > > > > The new encoding combined with a compressor performs better than any
> > of
> > > > the other alternatives for data where there is little change in the
> > > > upper-most bytes of fp32 and fp64 values. My early experiments also
> > show
> > > > that this encoding+zstd performs better on average than any of the
> > > > specialized floating-point lossless compressors like fpc, spdp, zfp.
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Martin
> > > > >
> > > > > 
> > > > > From: Radev, Martin 
> > > > > Sent: Thursday, October 10, 2019 2:34:15 PM
> > > > > To: Parquet Dev
> > > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
> > > > >
> > > > > Dear Ryan Blue and other Parquet developers,
> > > > >
> > > > > I tested Ryan's proposal for modifying the encoding.
> > > > >
> > > > > The short answer is that it doesn't perform well in my tests. The
> > > > encoding, code and results can be viewed below.
> > > > >
> > > > >
> > > > > The current implementation only handles 32-bit IEEE754 floats in the
> > > > following way:
> > > > >
> > > > >   1.  For each block of 128 values, the min and max is computed for
> > the
> > > > exponent

[jira] [Commented] (PARQUET-1688) [C++] StreamWriter/StreamReader can't be built with g++ 4.8.5 on CentOS 7

2019-11-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969737#comment-16969737
 ] 

Wes McKinney commented on PARQUET-1688:
---

in PARQUET. I'll close ARROW-7088

> [C++] StreamWriter/StreamReader can't be built with g++ 4.8.5 on CentOS 7
> -
>
> Key: PARQUET-1688
> URL: https://issues.apache.org/jira/browse/PARQUET-1688
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Kouhei Sutou
>Priority: Major
>
> cc [~gawain_bolton]
> This is caused since PARQUET-1678 is merged.
> It seems that g++ 4.8.5 on CentOS 7 doesn't have the default implementation 
> of {{operator=() noexcept}}:
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=2562=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=5469=5484=1=1
> {noformat}
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:31:0,
>  from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
>  error: function 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
>StreamWriter& operator=(StreamWriter&&) noexcept = default;
>  ^
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:0:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:61:17:
>  error: function 'parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)'
>StreamReader& operator=(StreamReader&&) noexcept = default;
>  ^
> make[2]: *** [src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o] 
> Error 1
> make[2]: *** Waiting for unfinished jobs
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.cc:18:0:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
>  error: function 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
>StreamWriter& operator=(StreamWriter&&) noexcept = default;
>  ^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1396) Cryptodata Interface for Schema Activation of Parquet Encryption

2019-11-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1396:
--
Fix Version/s: (was: 1.11.0)

> Cryptodata Interface for Schema Activation of Parquet Encryption
> 
>
> Key: PARQUET-1396
> URL: https://issues.apache.org/jira/browse/PARQUET-1396
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Xinli Shang
>Priority: Major
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. 
> This JIRA provides a crypto data interface for schema activation of Parquet 
> encryption and serves as a high-level layer on top of PARQUET-1178 to make 
> the adoption of Parquet-1178 easier, with pluggable key access module, 
> without a need to use the low-level encryption APIs. Also, this feature will 
> enable seamless integration with existing clients.
> No change to specifications (Parquet-format), no new Parquet APIs, and no 
> changes in existing Parquet APIs. All current applications, tests, etc, will 
> work.
> From developer perspective, they can just implement the interface into a 
> plugin which can be attached any Parquet application like Hive/Spark etc. 
> This decouples the complexity of dealing with KMS and schema from Parquet 
> applications. In large organization, they may have hundreds or even thousands 
> of Parquet applications and pipelines. The decoupling would make Parquet 
> encryption easier to be adopted.  
> From end user(for example data owner) perspective, if they think a column is 
> sensitive, they can just set that column’s schema as sensitive and then the 
> Parquet application just encrypt that column automatically. This makes end 
> user easy to manage the encryptions of their columns.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1396) Cryptodata Interface for Schema Activation of Parquet Encryption

2019-11-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969043#comment-16969043
 ] 

Gabor Szadovszky commented on PARQUET-1396:
---

As it is related to encryption which is not targeted to 1.11.0 I'm removing 
target from here.

> Cryptodata Interface for Schema Activation of Parquet Encryption
> 
>
> Key: PARQUET-1396
> URL: https://issues.apache.org/jira/browse/PARQUET-1396
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.11.0
>
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. 
> This JIRA provides a crypto data interface for schema activation of Parquet 
> encryption and serves as a high-level layer on top of PARQUET-1178 to make 
> the adoption of Parquet-1178 easier, with pluggable key access module, 
> without a need to use the low-level encryption APIs. Also, this feature will 
> enable seamless integration with existing clients.
> No change to specifications (Parquet-format), no new Parquet APIs, and no 
> changes in existing Parquet APIs. All current applications, tests, etc, will 
> work.
> From developer perspective, they can just implement the interface into a 
> plugin which can be attached any Parquet application like Hive/Spark etc. 
> This decouples the complexity of dealing with KMS and schema from Parquet 
> applications. In large organization, they may have hundreds or even thousands 
> of Parquet applications and pipelines. The decoupling would make Parquet 
> encryption easier to be adopted.  
> From end user(for example data owner) perspective, if they think a column is 
> sensitive, they can just set that column’s schema as sensitive and then the 
> Parquet application just encrypt that column automatically. This makes end 
> user easy to manage the encryptions of their columns.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1397) Sample of usage Parquet-1396 and Parquet-1178 for column level encryption with pluggable key access

2019-11-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969044#comment-16969044
 ] 

Gabor Szadovszky commented on PARQUET-1397:
---

As it is related to encryption which is not targeted to 1.11.0 I'm removing 
target from here.

> Sample of usage Parquet-1396 and Parquet-1178 for column level encryption 
> with pluggable key access
> ---
>
> Key: PARQUET-1397
> URL: https://issues.apache.org/jira/browse/PARQUET-1397
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Xinli Shang
>Priority: Major
>
> This Jira provides a sample to use Parquet-1396 and Parquet-1178 column level 
> encryption with pluggable key access. The Spark SQL application shows how to 
> configure Parquet-1396 to encrypt and columns. The project 
> CryptoMetadataRetriever shows how to implement the interface defined in 
> Parquet-1396 as an example. 
> The project to be uploaded soon. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1397) Sample of usage Parquet-1396 and Parquet-1178 for column level encryption with pluggable key access

2019-11-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1397:
--
Fix Version/s: (was: 1.11.0)

> Sample of usage Parquet-1396 and Parquet-1178 for column level encryption 
> with pluggable key access
> ---
>
> Key: PARQUET-1397
> URL: https://issues.apache.org/jira/browse/PARQUET-1397
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Xinli Shang
>Priority: Major
>
> This Jira provides a sample to use Parquet-1396 and Parquet-1178 column level 
> encryption with pluggable key access. The Spark SQL application shows how to 
> configure Parquet-1396 to encrypt and columns. The project 
> CryptoMetadataRetriever shows how to implement the interface defined in 
> Parquet-1396 as an example. 
> The project to be uploaded soon. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1645) Bump Apache Avro to 1.9.1

2019-11-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969045#comment-16969045
 ] 

Gabor Szadovszky commented on PARQUET-1645:
---

[~fokko], is it really targeted to 1.11.0?

> Bump Apache Avro to 1.9.1
> -
>
> Key: PARQUET-1645
> URL: https://issues.apache.org/jira/browse/PARQUET-1645
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1667) Close InputStream after usage

2019-11-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1667.
---
Resolution: Won't Fix

Based on the discussion in the PR I'm closing this.

> Close InputStream after usage
> -
>
> Key: PARQUET-1667
> URL: https://issues.apache.org/jira/browse/PARQUET-1667
> Project: Parquet
>  Issue Type: Task
>Affects Versions: 1.11.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Make sure that the streams are closed using try-with-resources



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1676) Remove hive modules

2019-11-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969049#comment-16969049
 ] 

Gabor Szadovszky commented on PARQUET-1676:
---

The parent PARQUET-1666 is targeted to 1.12.0. Is it really targeted to 1.11.0?

> Remove hive modules
> ---
>
> Key: PARQUET-1676
> URL: https://issues.apache.org/jira/browse/PARQUET-1676
> Project: Parquet
>  Issue Type: Task
>Affects Versions: 1.10.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Remove the hive modules as discusses in the Parquet sync.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1408) parquet-tools SimpleRecord does not display empty fields

2019-11-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969038#comment-16969038
 ] 

Gabor Szadovszky commented on PARQUET-1408:
---

As this issue is not a regression since 1.10.0 and is minor I am removing the 
target 1.11.0.

> parquet-tools SimpleRecord does not display empty fields
> 
>
> Key: PARQUET-1408
> URL: https://issues.apache.org/jira/browse/PARQUET-1408
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Nicholas Rushton
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> When using parquet-tools on a parquet file with null records the null columns 
> are omitted from the output.
>  
> Example:
> {code:java}
> scala> case class Foo(a: Int, b: String)
> defined class Foo
> scala> org.apache.spark.sql.SparkSession.builder.getOrCreate.createDataset((0 
> to 1000).map(x => Foo(1,null))).write.parquet("/tmp/foobar/"){code}
> Actual:
> {code:java}
> ☁  parquet-tools [master] ⚡  java -jar 
> target/parquet-tools-1.10.1-SNAPSHOT.jar cat -j 
> /tmp/foobar/part-0-436a4d37-d82a-4771-8e7e-e4d428464675-c000.snappy.parquet
>  | head -n5
> {"a":1}
> {"a":1}
> {"a":1}
> {"a":1}
> {"a":1}{code}
> Expected:
> {code:java}
> ☁  parquet-tools [master] ⚡  java -jar 
> target/parquet-tools-1.10.1-SNAPSHOT.jar cat -j 
> /tmp/foobar/part-0-436a4d37-d82a-4771-8e7e-e4d428464675-c000.snappy.parquet
>  | head -n5
> {"a":1,"b":null}
> {"a":1,"b":null}
> {"a":1,"b":null}
> {"a":1,"b":null}
> {"a":1,"b":null}{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1408) parquet-tools SimpleRecord does not display empty fields

2019-11-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1408:
--
Fix Version/s: (was: 1.11.0)

> parquet-tools SimpleRecord does not display empty fields
> 
>
> Key: PARQUET-1408
> URL: https://issues.apache.org/jira/browse/PARQUET-1408
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Nicholas Rushton
>Priority: Minor
>  Labels: pull-request-available
>
> When using parquet-tools on a parquet file with null records the null columns 
> are omitted from the output.
>  
> Example:
> {code:java}
> scala> case class Foo(a: Int, b: String)
> defined class Foo
> scala> org.apache.spark.sql.SparkSession.builder.getOrCreate.createDataset((0 
> to 1000).map(x => Foo(1,null))).write.parquet("/tmp/foobar/"){code}
> Actual:
> {code:java}
> ☁  parquet-tools [master] ⚡  java -jar 
> target/parquet-tools-1.10.1-SNAPSHOT.jar cat -j 
> /tmp/foobar/part-0-436a4d37-d82a-4771-8e7e-e4d428464675-c000.snappy.parquet
>  | head -n5
> {"a":1}
> {"a":1}
> {"a":1}
> {"a":1}
> {"a":1}{code}
> Expected:
> {code:java}
> ☁  parquet-tools [master] ⚡  java -jar 
> target/parquet-tools-1.10.1-SNAPSHOT.jar cat -j 
> /tmp/foobar/part-0-436a4d37-d82a-4771-8e7e-e4d428464675-c000.snappy.parquet
>  | head -n5
> {"a":1,"b":null}
> {"a":1,"b":null}
> {"a":1,"b":null}
> {"a":1,"b":null}
> {"a":1,"b":null}{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-site] gszadovszky merged pull request #1: PARQUET-1674: The announcement email on the web site does not comply with ASF rules

2019-11-07 Thread GitBox
gszadovszky merged pull request #1: PARQUET-1674: The announcement email on the 
web site does not comply with ASF rules
URL: https://github.com/apache/parquet-site/pull/1
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1681:
--
Fix Version/s: (was: 1.11.0)

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969042#comment-16969042
 ] 

Gabor Szadovszky commented on PARQUET-1681:
---

I am removing the target 1.11.0 as this is an improvement and not a regression 
since 1.10.0.

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Priority: Critical
> Fix For: 1.11.0
>
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Merged C++ Parquet Encryption implementation PARQUET-1300

2019-11-07 Thread Wes McKinney
hi folks,

I recently merged https://github.com/apache/arrow/pull/4826 containing
the bulk of the Parquet C++ encrypted file implementation:

https://github.com/apache/arrow/commit/41753ace481a82dea651c54639ec4adbae169187

This patch has been in progress for over a year with numerous rounds
of review, so I wanted to thank everyone for their hard work on this
project.

I'm copying dev@arrow because I would guess this patch has various
implications on packaging and build scripts and some JIRA issues may
need to be opened.

Note: I'm concerned about cross-implementation compatibility, so
developing some automated compatibility tests that exercise the
different modes of encryption (encrypted metadata, plaintext metadata,
and so forth) seems like a good idea to me.

Thanks,
Wes


[jira] [Comment Edited] (PARQUET-1688) [C++] StreamWriter/StreamReader can't be built with g++ 4.8.5 on CentOS 7

2019-11-07 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969734#comment-16969734
 ] 

Kouhei Sutou edited comment on PARQUET-1688 at 11/8/19 1:56 AM:


Oh, sorry. I missed it.

Which project should we use to track this? ARROW- or PARQUET-?


was (Author: kou):
Oh, sorry. I missed it.

Should we track this in ARROW- or PARQUET-?

> [C++] StreamWriter/StreamReader can't be built with g++ 4.8.5 on CentOS 7
> -
>
> Key: PARQUET-1688
> URL: https://issues.apache.org/jira/browse/PARQUET-1688
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Kouhei Sutou
>Priority: Major
>
> cc [~gawain_bolton]
> This is caused since PARQUET-1678 is merged.
> It seems that g++ 4.8.5 on CentOS 7 doesn't have the default implementation 
> of {{operator=() noexcept}}:
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=2562=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=5469=5484=1=1
> {noformat}
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:31:0,
>  from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
>  error: function 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
>StreamWriter& operator=(StreamWriter&&) noexcept = default;
>  ^
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:0:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:61:17:
>  error: function 'parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)'
>StreamReader& operator=(StreamReader&&) noexcept = default;
>  ^
> make[2]: *** [src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o] 
> Error 1
> make[2]: *** Waiting for unfinished jobs
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.cc:18:0:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
>  error: function 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
>StreamWriter& operator=(StreamWriter&&) noexcept = default;
>  ^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1688) [C++] StreamWriter/StreamReader can't be built with g++ 4.8.5 on CentOS 7

2019-11-07 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969734#comment-16969734
 ] 

Kouhei Sutou commented on PARQUET-1688:
---

Oh, sorry. I missed it.

Should we track this in ARROW- or PARQUET-?

> [C++] StreamWriter/StreamReader can't be built with g++ 4.8.5 on CentOS 7
> -
>
> Key: PARQUET-1688
> URL: https://issues.apache.org/jira/browse/PARQUET-1688
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Kouhei Sutou
>Priority: Major
>
> cc [~gawain_bolton]
> This is caused since PARQUET-1678 is merged.
> It seems that g++ 4.8.5 on CentOS 7 doesn't have the default implementation 
> of {{operator=() noexcept}}:
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=2562=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=5b4cc83a-7bb0-5664-5bb1-588f7e4dc05b=5469=5484=1=1
> {noformat}
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:31:0,
>  from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
>  error: function 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
>StreamWriter& operator=(StreamWriter&&) noexcept = default;
>  ^
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.cc:18:0:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_reader.h:61:17:
>  error: function 'parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)'
>StreamReader& operator=(StreamReader&&) noexcept = default;
>  ^
> make[2]: *** [src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o] 
> Error 1
> make[2]: *** Waiting for unfinished jobs
> In file included from 
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.cc:18:0:
> /root/rpmbuild/BUILD/apache-arrow-0.15.0.dev227/cpp/src/parquet/stream_writer.h:67:17:
>  error: function 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)' defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration 'parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)'
>StreamWriter& operator=(StreamWriter&&) noexcept = default;
>  ^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)