[GitHub] [parquet-mr] dbtsai commented on pull request #793: PARQUET-1866: Replace Hadoop ZSTD with JNI-ZSTD

2020-06-01 Thread GitBox


dbtsai commented on pull request #793:
URL: https://github.com/apache/parquet-mr/pull/793#issuecomment-637195534


   @shangxinli do we have benchmark comparing to native hadoop codec both in 
size and speed? Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1866) Replace Hadoop ZSTD with JNI-ZSTD

2020-06-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121423#comment-17121423
 ] 

ASF GitHub Bot commented on PARQUET-1866:
-

dbtsai commented on pull request #793:
URL: https://github.com/apache/parquet-mr/pull/793#issuecomment-637195534


   @shangxinli do we have benchmark comparing to native hadoop codec both in 
size and speed? Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Replace Hadoop ZSTD with JNI-ZSTD
> -
>
> Key: PARQUET-1866
> URL: https://issues.apache.org/jira/browse/PARQUET-1866
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> The parquet-mr repo has been using 
> [ZSTD-JNI|https://github.com/luben/zstd-jni/tree/master/src/main/java/com/github/luben/zstd]
>  for the parquet-cli project. It is a cleaner approach to use this JNI than 
> using Hadoop ZSTD compression, because 1) on the developing box, installing 
> Hadoop is cumbersome, 2) Older version of Hadoop doesn't support ZSTD. 
> Upgrading Hadoop is another pain. This Jira is to replace Hadoop ZSTD with 
> ZSTD-JNI for parquet-hadoop project. 
> According to the author of ZSTD-JNI, Flink, Spark, Cassandra all use ZSTD-JNI 
> for ZSTD.
> Another approach is to use https://github.com/airlift/aircompressor which is 
> a pure Java implementation. But it seems the compression level is not 
> adjustable in aircompressor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1866) Replace Hadoop ZSTD with JNI-ZSTD

2020-06-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121421#comment-17121421
 ] 

ASF GitHub Bot commented on PARQUET-1866:
-

dbtsai commented on pull request #793:
URL: https://github.com/apache/parquet-mr/pull/793#issuecomment-637193519


   +1 @shangxinli and thank you for this contribution. 
   
   This will allow users who are on order versions of hadoop that don't support 
native ZSTD to use ZSTD compression in Parquet, and also, users don't have to 
go through the very complicated hadoop native installation. For developers, we 
will be able to easily test this out in different local envs.  
   
   cc @rdblue 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Replace Hadoop ZSTD with JNI-ZSTD
> -
>
> Key: PARQUET-1866
> URL: https://issues.apache.org/jira/browse/PARQUET-1866
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> The parquet-mr repo has been using 
> [ZSTD-JNI|https://github.com/luben/zstd-jni/tree/master/src/main/java/com/github/luben/zstd]
>  for the parquet-cli project. It is a cleaner approach to use this JNI than 
> using Hadoop ZSTD compression, because 1) on the developing box, installing 
> Hadoop is cumbersome, 2) Older version of Hadoop doesn't support ZSTD. 
> Upgrading Hadoop is another pain. This Jira is to replace Hadoop ZSTD with 
> ZSTD-JNI for parquet-hadoop project. 
> According to the author of ZSTD-JNI, Flink, Spark, Cassandra all use ZSTD-JNI 
> for ZSTD.
> Another approach is to use https://github.com/airlift/aircompressor which is 
> a pure Java implementation. But it seems the compression level is not 
> adjustable in aircompressor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] dbtsai commented on pull request #793: PARQUET-1866: Replace Hadoop ZSTD with JNI-ZSTD

2020-06-01 Thread GitBox


dbtsai commented on pull request #793:
URL: https://github.com/apache/parquet-mr/pull/793#issuecomment-637193519


   +1 @shangxinli and thank you for this contribution. 
   
   This will allow users who are on order versions of hadoop that don't support 
native ZSTD to use ZSTD compression in Parquet, and also, users don't have to 
go through the very complicated hadoop native installation. For developers, we 
will be able to easily test this out in different local envs.  
   
   cc @rdblue 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1684) [parquet-protobuf] default protobuf field values are stored as nulls

2020-06-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121420#comment-17121420
 ] 

ASF GitHub Bot commented on PARQUET-1684:
-

bagipriyank commented on pull request #702:
URL: https://github.com/apache/parquet-mr/pull/702#issuecomment-637192988


   I submitted this pr after we started using it in production :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [parquet-protobuf] default protobuf field values are stored as nulls
> 
>
> Key: PARQUET-1684
> URL: https://issues.apache.org/jira/browse/PARQUET-1684
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.11.0
>Reporter: George Haddad
>Priority: Major
>  Labels: pull-request-available
>
> When the source is a protobuf3 message, and the target file is Parquet, all 
> the default values are stored in the output parquet as `{{null`}} instead of 
> the actual type's default value.
>  For example, if the field is of type `int32`, `double` or `enum` and it 
> hasn't been set, the parquet value is `{{null`}} instead of `0`. When the 
> field's type is a `string` that hasn't been set, the parquet value is 
> {{`null`}} instead of an empty string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] bagipriyank commented on pull request #702: PARQUET-1684: dont store default protobuf values as null for proto3

2020-06-01 Thread GitBox


bagipriyank commented on pull request #702:
URL: https://github.com/apache/parquet-mr/pull/702#issuecomment-637192988


   I submitted this pr after we started using it in production :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (PARQUET-1850) toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit

2020-06-01 Thread Aniket Namadeo Mokashi (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Namadeo Mokashi resolved PARQUET-1850.
-
Resolution: Fixed

> toParquetMetadata method in ParquetMetadataConverter does not set dictionary 
> page offset bit
> 
>
> Key: PARQUET-1850
> URL: https://issues.apache.org/jira/browse/PARQUET-1850
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1, 1.12.0
>Reporter: Srinivas S T
>Assignee: Srinivas S T
>Priority: Major
> Fix For: 1.12.0
>
>
> toParquetMetadata method converts 
> org.apache.parquet.hadoop.metadata.ParquetMetadata to 
> org.apache.parquet.format.FileMetaData but this does not set the dictionary 
> page offset bit in FileMetaData.
> When a FileMetaData object is serialized while writing to the footer and then 
> deserialized, the dictionary offset is lost as the dictionary page offset bit 
> was never set. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1850) toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit

2020-06-01 Thread Aniket Namadeo Mokashi (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Namadeo Mokashi reassigned PARQUET-1850:
---

Assignee: Srinivas S T

> toParquetMetadata method in ParquetMetadataConverter does not set dictionary 
> page offset bit
> 
>
> Key: PARQUET-1850
> URL: https://issues.apache.org/jira/browse/PARQUET-1850
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1, 1.12.0
>Reporter: Srinivas S T
>Assignee: Srinivas S T
>Priority: Major
> Fix For: 1.12.0
>
>
> toParquetMetadata method converts 
> org.apache.parquet.hadoop.metadata.ParquetMetadata to 
> org.apache.parquet.format.FileMetaData but this does not set the dictionary 
> page offset bit in FileMetaData.
> When a FileMetaData object is serialized while writing to the footer and then 
> deserialized, the dictionary offset is lost as the dictionary page offset bit 
> was never set. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1850) toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit

2020-06-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121187#comment-17121187
 ] 

ASF GitHub Bot commented on PARQUET-1850:
-

asfgit closed pull request #789:
URL: https://github.com/apache/parquet-mr/pull/789


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> toParquetMetadata method in ParquetMetadataConverter does not set dictionary 
> page offset bit
> 
>
> Key: PARQUET-1850
> URL: https://issues.apache.org/jira/browse/PARQUET-1850
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1, 1.12.0
>Reporter: Srinivas S T
>Priority: Major
> Fix For: 1.12.0
>
>
> toParquetMetadata method converts 
> org.apache.parquet.hadoop.metadata.ParquetMetadata to 
> org.apache.parquet.format.FileMetaData but this does not set the dictionary 
> page offset bit in FileMetaData.
> When a FileMetaData object is serialized while writing to the footer and then 
> deserialized, the dictionary offset is lost as the dictionary page offset bit 
> was never set. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] asfgit closed pull request #789: PARQUET-1850: Fix dictionaryPageOffset flag setting in toParquetMetadata method

2020-06-01 Thread GitBox


asfgit closed pull request #789:
URL: https://github.com/apache/parquet-mr/pull/789


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1684) [parquet-protobuf] default protobuf field values are stored as nulls

2020-06-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121131#comment-17121131
 ] 

ASF GitHub Bot commented on PARQUET-1684:
-

Fokko commented on pull request #702:
URL: https://github.com/apache/parquet-mr/pull/702#issuecomment-636951813


   The main issue here is that there are no Protobuf committers active on the 
project anymore. Did someone already this patch already in a production 
environment?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [parquet-protobuf] default protobuf field values are stored as nulls
> 
>
> Key: PARQUET-1684
> URL: https://issues.apache.org/jira/browse/PARQUET-1684
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.11.0
>Reporter: George Haddad
>Priority: Major
>  Labels: pull-request-available
>
> When the source is a protobuf3 message, and the target file is Parquet, all 
> the default values are stored in the output parquet as `{{null`}} instead of 
> the actual type's default value.
>  For example, if the field is of type `int32`, `double` or `enum` and it 
> hasn't been set, the parquet value is `{{null`}} instead of `0`. When the 
> field's type is a `string` that hasn't been set, the parquet value is 
> {{`null`}} instead of an empty string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] Fokko commented on pull request #702: PARQUET-1684: dont store default protobuf values as null for proto3

2020-06-01 Thread GitBox


Fokko commented on pull request #702:
URL: https://github.com/apache/parquet-mr/pull/702#issuecomment-636951813


   The main issue here is that there are no Protobuf committers active on the 
project anymore. Did someone already this patch already in a production 
environment?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org