Re: [VOTE] Release Apache Parquet 1.11.1 RC1
+1 (binding) verified sigs/sums/build/test I did have some problems with the tests. It appears that you cannot run specific project tests in isolation in some cases (e.g. mvn test -pl parquet-tools fails, but works in conjunction with other tests). Nothing to hold up the release for, but appears some of the test classpaths aren't setup correctly. -Dan On 2020/07/29 08:23:49, Gabor Szadovszky wrote: > Hi everyone, > > I propose the following RC to be released as the official Apache Parquet > 1.11.1 release. > > The commit id is 765bd5cd7fdef2af1cecd0755000694b992bfadd > * This corresponds to the tag: apache-parquet-1.11.1-rc1 > * > https://github.com/apache/parquet-mr/tree/765bd5cd7fdef2af1cecd0755000694b992bfadd > > The release tarball, signature, and checksums are here: > * https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.11.1-rc1 > > You can find the KEYS file here: > * https://downloads.apache.org/parquet/KEYS > > Binary artifacts are staged in Nexus here: > * https://repository.apache.org/content/groups/staging/org/apache/parquet/ > > This release includes changes listed at > https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1-rc1/CHANGES.md > . > > Please download, verify, and test. > > Please vote in the next 72 hours. > > [ ] +1 Release this as Apache Parquet 1.11.1 > [ ] +0 > [ ] -1 Do not release this because... >
[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…
shangxinli commented on pull request #808: URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-672148979 @gszadovszky Thanks for the correction of PARQUET-1784! Regarding the serialized/deserialized, it is not done. I was aware of that when I use ExtType. But that is something we will need to add later. Actually it is needed. The use case is that we need to translate the schema inside the Parquet files created by upstream like rawdata, to downstream HiveETL metastores like HMS. The linage of the crypto properties will be broken otherwise. Actually this is a reason we should add metadata(along with serialized/deserialized) instead of using Configuration. Creating helper functions helps but the problem is still that we need to add a long namespace all the way to the column level(nested). Sometimes one job needs to deal with more than one metastores. That requires adding a prefix to the namespace. So to locate a column, we need something like metastore.db.table.column_outlayerscolumn_innerlayers.crypto_key. This is not user friendly. Again, other schemas like Avro, Spark already have that, I think it would be better to alight with other schemas. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…
shangxinli commented on pull request #808: URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-672093542 @ggershinsky, from what we have discussed above I think now the different opinions narrow to 'how to transport the settings from the extended ParquetWriteSupport to CryptoPropertiesFactory implementation'. Either we can let it through 'Configuration' or schema. I don't see the difference between these two approaches regarding the stated goal. Regarding which way is better, I think it depends. To some extend, adding the encryption properties to the schema is easier and less error-prone because it is just next to the schema elements. It seems Gabor got the point. We should let the user choose the one that fits their use case better. Some users can even choose RPC calls instead of the two we talked about. This is already 3rd. There could be more, for example, they can choose to load from configuration files etc. Again, it should be the user's choice. The column level properties being part of the column metadata should not be a problem because Avro, Spark already have that. If the change to add the metadata field is risky in terms of breaking existing code, that would be a concern. But it seems not the case. So I don't quite get the point why we don't want to do that. We have had several rounds of discussion here. If you still have questions or concerns, I would like to set up a meeting to go through that. Please let me know. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (PARQUET-1793) Support writing INT96 timestamp from avro
[ https://issues.apache.org/jira/browse/PARQUET-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-1793. --- Resolution: Won't Fix > Support writing INT96 timestamp from avro > - > > Key: PARQUET-1793 > URL: https://issues.apache.org/jira/browse/PARQUET-1793 > Project: Parquet > Issue Type: Improvement >Reporter: Tamas Palfy >Priority: Major > > Add support for writing avro LONG/timestamp-millis data in INT96 (or int the > current INT64) format in parquet. > Add a config flag to select the required timestamp output format (INT96 or > INT64). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] gszadovszky commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…
gszadovszky commented on pull request #808: URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-671925400 @shangxinli, The column-wise configuration you are talking about (PARQUET-1784: Column-wise configuration (#754)) is only a specified key format and the related helper implementations for the Hadoop conf. We might have used this format to specify the encryption properties but I'm afraid it is do late to do that and I am even unsure if it would make sense to have a completely different approach for setting such properties than what the other components in the Hadoop era use. I tend to agree with @ggershinsky. The way you want to extend the parquet schema is a general extension to add any metadata for any schema elements. However, I cannot see any more purpose but what you have described. Moreover, this way you are only extending the schema objects that are used only inside parquet-mr. This metadata won't be written to the parquet files nor serialized/deserialized to/from the metastore as is. Anything you want to be in this metadata have to be implemented either inside parquet-mr or in the plugins. What you have described is good in adding the encryption properties to the schema is that it is easier and less error prone to define the properties just next to the schema elements (columns). But you can also write helper methods which can write the proper key/values to the hadoop conf or the extra metadata. These helpers can be unit tested to ensure they are working correctly. This way the implementation of the ParquetWriteSupport can be compact and type/value checked. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] ggershinsky commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…
ggershinsky commented on pull request #808: URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-671793784 Yep, a couple of concerns related to encryption. The stated goal of the crypto factory design is to be "transparent to analytic frameworks, so they can leverage Parquet modular encryption without any code changes". I think it's a good goal, and a powerful capability, worth preserving. Also, proliferation of custom channels for passing the encryption properties might lead to confusion and hard-to-trace problems in the future. Can be avoided, since the current channels can support the existing usecase requirements. Beyond encryption - I believe using a general Object as an interface parameter is considered to be problematic. Besides, adding a third channel for custom property passing can lead to issues in other areas as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org