Re: [VOTE] Release Apache Parquet 1.11.1 RC1

2020-08-11 Thread Daniel C . Weeks
+1 (binding) verified sigs/sums/build/test

I did have some problems with the tests.  It appears that you cannot run 
specific project tests in isolation in some cases (e.g. mvn test -pl 
parquet-tools fails, but works in conjunction with other tests).  Nothing to 
hold up the release for, but appears some of the test classpaths aren't setup 
correctly.

-Dan


On 2020/07/29 08:23:49, Gabor Szadovszky  wrote: 
> Hi everyone,
> 
> I propose the following RC to be released as the official Apache Parquet
> 1.11.1 release.
> 
> The commit id is 765bd5cd7fdef2af1cecd0755000694b992bfadd
> * This corresponds to the tag: apache-parquet-1.11.1-rc1
> *
> https://github.com/apache/parquet-mr/tree/765bd5cd7fdef2af1cecd0755000694b992bfadd
> 
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.11.1-rc1
> 
> You can find the KEYS file here:
> * https://downloads.apache.org/parquet/KEYS
> 
> Binary artifacts are staged in Nexus here:
> * https://repository.apache.org/content/groups/staging/org/apache/parquet/
> 
> This release includes changes listed at
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1-rc1/CHANGES.md
> .
> 
> Please download, verify, and test.
> 
> Please vote in the next 72 hours.
> 
> [ ] +1 Release this as Apache Parquet 1.11.1
> [ ] +0
> [ ] -1 Do not release this because...
> 


[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-11 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-672148979


   @gszadovszky Thanks for the correction of PARQUET-1784!  Regarding the 
serialized/deserialized,  it is not done. I was aware of that when I use 
ExtType.  But that is something we will need to add later. Actually it is 
needed. The use case is that we need to translate the schema inside the Parquet 
files created by upstream like rawdata, to downstream HiveETL metastores like 
HMS.  The linage of the crypto properties will be broken otherwise. Actually 
this is a reason we should add metadata(along with serialized/deserialized) 
instead of using Configuration. 
   
   Creating helper functions helps but the problem is still that we need to add 
a long namespace all the way to the column level(nested). Sometimes one job 
needs to deal with more than one metastores. That requires adding a prefix to 
the namespace. So to locate a column, we need something like 
metastore.db.table.column_outlayerscolumn_innerlayers.crypto_key. This is 
not user friendly. Again, other schemas like Avro, Spark already have that, I 
think it would be better to alight with other schemas. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-11 Thread GitBox


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-672093542


   @ggershinsky, from what we have discussed above I think now the different 
opinions narrow to 'how to transport the settings from the extended 
ParquetWriteSupport to CryptoPropertiesFactory implementation'. Either we can 
let it through 'Configuration' or schema. I don't see the difference between 
these two approaches regarding the stated goal. 
   
   Regarding which way is better, I think it depends. To some extend, adding 
the encryption properties to the schema is easier and less error-prone because 
it is just next to the schema elements. It seems Gabor got the point.  We 
should let the user choose the one that fits their use case better. Some users 
can even choose RPC calls instead of the two we talked about. This is already 
3rd. There could be more, for example, they can choose to load from 
configuration files etc. Again, it should be the user's choice. The column 
level properties being part of the column metadata should not be a problem 
because Avro, Spark already have that. If the change to add the metadata field 
is risky in terms of breaking existing code, that would be a concern. But it 
seems not the case.  So I don't quite get the point why we don't want to do 
that. 
   
   We have had several rounds of discussion here. If you still have questions 
or concerns, I would like to set up a meeting to go through that. Please let me 
know. 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (PARQUET-1793) Support writing INT96 timestamp from avro

2020-08-11 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1793.
---
Resolution: Won't Fix

> Support writing INT96 timestamp from avro
> -
>
> Key: PARQUET-1793
> URL: https://issues.apache.org/jira/browse/PARQUET-1793
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Tamas Palfy
>Priority: Major
>
> Add support for writing avro LONG/timestamp-millis data in INT96 (or int the 
> current INT64) format in parquet.
> Add a config flag to select the required timestamp output format (INT96 or 
> INT64). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] gszadovszky commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-11 Thread GitBox


gszadovszky commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-671925400


   @shangxinli, The column-wise configuration you are talking about 
(PARQUET-1784: Column-wise configuration (#754)) is only a specified key format 
and the related helper implementations for the Hadoop conf. We might have used 
this format to specify the encryption properties but I'm afraid it is do late 
to do that and I am even unsure if it would make sense to have a completely 
different approach for setting such properties than what the other components 
in the Hadoop era use.
   
   I tend to agree with @ggershinsky. The way you want to extend the parquet 
schema is a general extension to add any metadata for any schema elements. 
However, I cannot see any more purpose but what you have described. Moreover, 
this way you are only extending the schema objects that are used only inside 
parquet-mr. This metadata won't be written to the parquet files nor 
serialized/deserialized to/from the metastore as is. Anything you want to be in 
this metadata have to be implemented either inside parquet-mr or in the plugins.
   
   What you have described is good in adding the encryption properties to the 
schema is that it is easier and less error prone to define the properties just 
next to the schema elements (columns). But you can also write helper methods 
which can write the proper key/values to the hadoop conf or the extra metadata. 
These helpers can be unit tested to ensure they are working correctly. This way 
the implementation of the ParquetWriteSupport can be compact and type/value 
checked.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] ggershinsky commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-08-11 Thread GitBox


ggershinsky commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-671793784


   Yep, a couple of concerns related to encryption. 
   The stated goal of the crypto factory design is to be "transparent to 
analytic frameworks, so they can leverage Parquet modular encryption without 
any code changes". I think it's a good goal, and a powerful capability, worth 
preserving.
   Also, proliferation of custom channels for passing the encryption properties 
might lead to confusion and hard-to-trace problems in the future. Can be 
avoided, since the current channels can support the existing usecase 
requirements.
   
   Beyond encryption - I believe using a general Object as an interface 
parameter is considered to be problematic. Besides, adding a third channel for 
custom property passing can lead to issues in other areas as well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org