[GitHub] [parquet-mr] huaxingao commented on pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-17 Thread GitBox


huaxingao commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1159354846

   Thank you all very much! @chenjunjiedada @dongjoon-hyun @ggershinsky 
@shangxinli 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2157) Add BloomFilter fpp config

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555826#comment-17555826
 ] 

ASF GitHub Bot commented on PARQUET-2157:
-

huaxingao commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1159354846

   Thank you all very much! @chenjunjiedada @dongjoon-hyun @ggershinsky 
@shangxinli 




> Add BloomFilter fpp config
> --
>
> Key: PARQUET-2157
> URL: https://issues.apache.org/jira/browse/PARQUET-2157
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Huaxin Gao
>Priority: Major
>
> Currently parquet-mr hardcoded bloom filter fpp (false positive probability) 
> to 0.01.  We should have a config to let user to specify fpp.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2150) parquet-protobuf to compile on mac M1

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555823#comment-17555823
 ] 

ASF GitHub Bot commented on PARQUET-2150:
-

shangxinli commented on PR #970:
URL: https://github.com/apache/parquet-mr/pull/970#issuecomment-1159351267

   @sunchao I see your change is to upgrade the protobuf version. Is that 
required to solve this problem, which I don't see in this PR. 




> parquet-protobuf to compile on mac M1
> -
>
> Key: PARQUET-2150
> URL: https://issues.apache.org/jira/browse/PARQUET-2150
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> parquet-protobuf module fails to compile on Mac M1 because the maven protoc 
> plugin cannot find the native osx-aarch_64:3.16.1  binary.
> the build needs to be tweaked to pick up the x86 binaries



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] shangxinli commented on pull request #970: PARQUET-2150: parquet-protobuf to compile on Mac M1

2022-06-17 Thread GitBox


shangxinli commented on PR #970:
URL: https://github.com/apache/parquet-mr/pull/970#issuecomment-1159351267

   @sunchao I see your change is to upgrade the protobuf version. Is that 
required to solve this problem, which I don't see in this PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555821#comment-17555821
 ] 

ASF GitHub Bot commented on PARQUET-2158:
-

sunchao commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900689594


##
parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:
##
@@ -20,8 +20,8 @@
 
 import org.apache.hadoop.fs.GlobPattern;
 
-import java.util.regex.Pattern;
-import java.util.regex.PatternSyntaxException;
+import com.google.re2j.Pattern;

Review Comment:
   I think this may not work for projects like Spark who are using Hadoop 
shaded client, since the `GlobPattern.compiled` is relocated to 
`org.apache.hadoop.shaded.com.google.re2j.Pattern`.
   
   It might be easier to just remove the class as it has been marked as 
deprecated since Parquet 1.8.0, 2015. It is also not used anywhere in the 
project.





> Upgrade Hadoop dependency to version 3.2.0
> --
>
> Key: PARQUET-2158
> URL: https://issues.apache.org/jira/browse/PARQUET-2158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> Parquet still builds against Hadoop 2.10. This is very out of date and does 
> not work with java 11, let alone later releases.
> Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with 
> java 11, and lines up with active work on  HADOOP-18287,  _Provide a shim 
> library for modern FS APIs_ 
> This will significantly speed up access to columnar data, especially  in 
> cloud stores.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555820#comment-17555820
 ] 

ASF GitHub Bot commented on PARQUET-2158:
-

sunchao commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900689594


##
parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:
##
@@ -20,8 +20,8 @@
 
 import org.apache.hadoop.fs.GlobPattern;
 
-import java.util.regex.Pattern;
-import java.util.regex.PatternSyntaxException;
+import com.google.re2j.Pattern;

Review Comment:
   I think this may not work for projects like Spark who are using Hadoop 
shaded client, since the `GlobPattern.compiled` is relocated to 
`org.apache.hadoop.shaded.com.google.re2j.Pattern`.
   
   It might be easier to just remove the class as it has been marked as 
deprecated since Parquet 1.8.0, 2015





> Upgrade Hadoop dependency to version 3.2.0
> --
>
> Key: PARQUET-2158
> URL: https://issues.apache.org/jira/browse/PARQUET-2158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> Parquet still builds against Hadoop 2.10. This is very out of date and does 
> not work with java 11, let alone later releases.
> Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with 
> java 11, and lines up with active work on  HADOOP-18287,  _Provide a shim 
> library for modern FS APIs_ 
> This will significantly speed up access to columnar data, especially  in 
> cloud stores.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] sunchao commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread GitBox


sunchao commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900689594


##
parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:
##
@@ -20,8 +20,8 @@
 
 import org.apache.hadoop.fs.GlobPattern;
 
-import java.util.regex.Pattern;
-import java.util.regex.PatternSyntaxException;
+import com.google.re2j.Pattern;

Review Comment:
   I think this may not work for projects like Spark who are using Hadoop 
shaded client, since the `GlobPattern.compiled` is relocated to 
`org.apache.hadoop.shaded.com.google.re2j.Pattern`.
   
   It might be easier to just remove the class as it has been marked as 
deprecated since Parquet 1.8.0, 2015



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sunchao commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread GitBox


sunchao commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900689594


##
parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:
##
@@ -20,8 +20,8 @@
 
 import org.apache.hadoop.fs.GlobPattern;
 
-import java.util.regex.Pattern;
-import java.util.regex.PatternSyntaxException;
+import com.google.re2j.Pattern;

Review Comment:
   I think this may not work for projects like Spark who are using Hadoop 
shaded client, since the `GlobPattern.compiled` is relocated to 
`org.apache.hadoop.shaded.com.google.re2j.Pattern`.
   
   It might be easier to just remove the class as it has been marked as 
deprecated since Parquet 1.8.0, 2015. It is also not used anywhere in the 
project.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2157) Add BloomFilter fpp config

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555817#comment-17555817
 ] 

ASF GitHub Bot commented on PARQUET-2157:
-

shangxinli commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1159349333

   LGTM




> Add BloomFilter fpp config
> --
>
> Key: PARQUET-2157
> URL: https://issues.apache.org/jira/browse/PARQUET-2157
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Huaxin Gao
>Priority: Major
>
> Currently parquet-mr hardcoded bloom filter fpp (false positive probability) 
> to 0.01.  We should have a config to let user to specify fpp.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2157) Add BloomFilter fpp config

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555818#comment-17555818
 ] 

ASF GitHub Bot commented on PARQUET-2157:
-

shangxinli merged PR #975:
URL: https://github.com/apache/parquet-mr/pull/975




> Add BloomFilter fpp config
> --
>
> Key: PARQUET-2157
> URL: https://issues.apache.org/jira/browse/PARQUET-2157
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Huaxin Gao
>Priority: Major
>
> Currently parquet-mr hardcoded bloom filter fpp (false positive probability) 
> to 0.01.  We should have a config to let user to specify fpp.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] shangxinli merged pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-17 Thread GitBox


shangxinli merged PR #975:
URL: https://github.com/apache/parquet-mr/pull/975


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-17 Thread GitBox


shangxinli commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1159349333

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555816#comment-17555816
 ] 

ASF GitHub Bot commented on PARQUET-2158:
-

shangxinli commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900688843


##
pom.xml:
##
@@ -76,7 +76,7 @@
 2.13.2.2
 0.14.2
 shaded.parquet
-2.10.1
+3.2.0

Review Comment:
   +1 for the question 





> Upgrade Hadoop dependency to version 3.2.0
> --
>
> Key: PARQUET-2158
> URL: https://issues.apache.org/jira/browse/PARQUET-2158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> Parquet still builds against Hadoop 2.10. This is very out of date and does 
> not work with java 11, let alone later releases.
> Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with 
> java 11, and lines up with active work on  HADOOP-18287,  _Provide a shim 
> library for modern FS APIs_ 
> This will significantly speed up access to columnar data, especially  in 
> cloud stores.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread GitBox


shangxinli commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900688843


##
pom.xml:
##
@@ -76,7 +76,7 @@
 2.13.2.2
 0.14.2
 shaded.parquet
-2.10.1
+3.2.0

Review Comment:
   +1 for the question 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2138) Add ShowBloomFilterCommand to parquet-cli

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555814#comment-17555814
 ] 

ASF GitHub Bot commented on PARQUET-2138:
-

shangxinli commented on PR #958:
URL: https://github.com/apache/parquet-mr/pull/958#issuecomment-1159347960

   @WangGuangxin Do you still plan to implement the decryption? I don't want to 
place a blocker if you don't have plan for it. 




> Add ShowBloomFilterCommand to parquet-cli
> -
>
> Key: PARQUET-2138
> URL: https://issues.apache.org/jira/browse/PARQUET-2138
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: EdisonWang
>Priority: Minor
>
> Add ShowBloomFilterCommand to parquet-cli, which can check whether given 
> values of a column match bloom filter



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] shangxinli commented on pull request #958: PARQUET-2138: Add ShowBloomFilterCommand to parquet-cli

2022-06-17 Thread GitBox


shangxinli commented on PR #958:
URL: https://github.com/apache/parquet-mr/pull/958#issuecomment-1159347960

   @WangGuangxin Do you still plan to implement the decryption? I don't want to 
place a blocker if you don't have plan for it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555813#comment-17555813
 ] 

ASF GitHub Bot commented on PARQUET-2069:
-

shangxinli commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r900688194


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {

Review Comment:
   I understand the targetted issue can be solved by this retry with a 
converted schema. But I am not sure if it is safe to just ignore Avro schema in 
case of exception. @rdblue @wesm Do you have some time to have a look at this? 



##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {

Review Comment:
   I understand the target issue can be solved by this retry with a converted 
schema. But I am not sure if it is safe to just ignore Avro schema in case of 
exception. @rdblue @wesm Do you have some time to have a look at this? 





> Parquet file containing arrays, written by Parquet-MR, cannot be read again 
> by Parquet-MR
> -
>
> Key: PARQUET-2069
> URL: https://issues.apache.org/jira/browse/PARQUET-2069
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.12.0
> Environment: Windows 10
>Reporter: Devon Kozenieski
>Priority: Blocker
> Attachments: modified.parquet, original.parquet, parquet-diff.png
>
>
> In the attached files, there is one original file, and one written modified 
> file that results after reading the original file and writing it back with 
> Parquet-MR, with a few values modified. The schema should not be modified, 
> since the schema of the input file is used as the schema to write the output 
> file. However, the output file has a slightly modified schema that then 
> cannot be read back the same way again with Parquet-MR, resulting in the 
> exception message:  java.lang.ClassCastException: optional binary element 
> (STRING) is not a group
> My guess is that the issue lies in the Avro schema conversion.
> The Parquet files attached have some arrays and some nested fields.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #957: PARQUET-2069: Allow list and array record types to be compatible.

2022-06-17 Thread GitBox


shangxinli commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r900688194


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {

Review Comment:
   I understand the targetted issue can be solved by this retry with a 
converted schema. But I am not sure if it is safe to just ignore Avro schema in 
case of exception. @rdblue @wesm Do you have some time to have a look at this? 



##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {

Review Comment:
   I understand the target issue can be solved by this retry with a converted 
schema. But I am not sure if it is safe to just ignore Avro schema in case of 
exception. @rdblue @wesm Do you have some time to have a look at this? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555811#comment-17555811
 ] 

ASF GitHub Bot commented on PARQUET-2069:
-

shangxinli commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r900687324


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {
+  System.err.println("Warning, Avro schema doesn't match Parquet schema, 
falling back to conversion: " + e.toString());

Review Comment:
   Any reason we don't use Log4j? 





> Parquet file containing arrays, written by Parquet-MR, cannot be read again 
> by Parquet-MR
> -
>
> Key: PARQUET-2069
> URL: https://issues.apache.org/jira/browse/PARQUET-2069
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.12.0
> Environment: Windows 10
>Reporter: Devon Kozenieski
>Priority: Blocker
> Attachments: modified.parquet, original.parquet, parquet-diff.png
>
>
> In the attached files, there is one original file, and one written modified 
> file that results after reading the original file and writing it back with 
> Parquet-MR, with a few values modified. The schema should not be modified, 
> since the schema of the input file is used as the schema to write the output 
> file. However, the output file has a slightly modified schema that then 
> cannot be read back the same way again with Parquet-MR, resulting in the 
> exception message:  java.lang.ClassCastException: optional binary element 
> (STRING) is not a group
> My guess is that the issue lies in the Avro schema conversion.
> The Parquet files attached have some arrays and some nested fields.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #957: PARQUET-2069: Allow list and array record types to be compatible.

2022-06-17 Thread GitBox


shangxinli commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r900687324


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {
+  System.err.println("Warning, Avro schema doesn't match Parquet schema, 
falling back to conversion: " + e.toString());

Review Comment:
   Any reason we don't use Log4j? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2069) Parquet file containing arrays, written by Parquet-MR, cannot be read again by Parquet-MR

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555810#comment-17555810
 ] 

ASF GitHub Bot commented on PARQUET-2069:
-

shangxinli commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r900687175


##
parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayListCompatibility.java:
##
@@ -0,0 +1,51 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.avro;
+
+import com.google.common.io.Resources;
+import org.apache.avro.generic.GenericData;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.Test;
+import java.io.IOException;
+
+public class TestArrayListCompatibility {
+
+  @Test
+  public void testListArrayCompatibility() throws IOException {
+Path testPath = new 
Path(Resources.getResource("list-array-compat.parquet").getFile());
+
+Configuration conf = new Configuration();
+ParquetReader parquetReader =
+  AvroParquetReader.builder(testPath).withConf(conf).build();
+GenericData.Record firstRecord;
+try {
+  firstRecord = (GenericData.Record) parquetReader.read();
+} catch (Exception x) {
+  x.printStackTrace();

Review Comment:
   I think if you don't catch, it would still print out the stack. 





> Parquet file containing arrays, written by Parquet-MR, cannot be read again 
> by Parquet-MR
> -
>
> Key: PARQUET-2069
> URL: https://issues.apache.org/jira/browse/PARQUET-2069
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.12.0
> Environment: Windows 10
>Reporter: Devon Kozenieski
>Priority: Blocker
> Attachments: modified.parquet, original.parquet, parquet-diff.png
>
>
> In the attached files, there is one original file, and one written modified 
> file that results after reading the original file and writing it back with 
> Parquet-MR, with a few values modified. The schema should not be modified, 
> since the schema of the input file is used as the schema to write the output 
> file. However, the output file has a slightly modified schema that then 
> cannot be read back the same way again with Parquet-MR, resulting in the 
> exception message:  java.lang.ClassCastException: optional binary element 
> (STRING) is not a group
> My guess is that the issue lies in the Avro schema conversion.
> The Parquet files attached have some arrays and some nested fields.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #957: PARQUET-2069: Allow list and array record types to be compatible.

2022-06-17 Thread GitBox


shangxinli commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r900687175


##
parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayListCompatibility.java:
##
@@ -0,0 +1,51 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.avro;
+
+import com.google.common.io.Resources;
+import org.apache.avro.generic.GenericData;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.Test;
+import java.io.IOException;
+
+public class TestArrayListCompatibility {
+
+  @Test
+  public void testListArrayCompatibility() throws IOException {
+Path testPath = new 
Path(Resources.getResource("list-array-compat.parquet").getFile());
+
+Configuration conf = new Configuration();
+ParquetReader parquetReader =
+  AvroParquetReader.builder(testPath).withConf(conf).build();
+GenericData.Record firstRecord;
+try {
+  firstRecord = (GenericData.Record) parquetReader.read();
+} catch (Exception x) {
+  x.printStackTrace();

Review Comment:
   I think if you don't catch, it would still print out the stack. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2134) Incorrect type checking in HadoopStreams.wrap

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555809#comment-17555809
 ] 

ASF GitHub Bot commented on PARQUET-2134:
-

shangxinli commented on code in PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#discussion_r900676217


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -50,51 +46,45 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
-byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
-  try {
-return h2SeekableConstructor.newInstance(stream);
-  } catch (InstantiationException | IllegalAccessException e) {
-LOG.warn("Could not instantiate H2SeekableInputStream, falling back to 
byte array reads", e);
-return new H1SeekableInputStream(stream);
-  } catch (InvocationTargetException e) {
-throw new ParquetDecodingException(
-"Could not instantiate H2SeekableInputStream", 
e.getTargetException());
-  }
+if (isWrappedStreamByteBufferReadable(stream)) {
+  return new H2SeekableInputStream(stream);
 } else {
   return new H1SeekableInputStream(stream);
 }
   }
 
-  private static Class getReadableClass() {
-try {
-  return Class.forName("org.apache.hadoop.fs.ByteBufferReadable");
-} catch (ClassNotFoundException | NoClassDefFoundError e) {
-  return null;
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is "the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * That is: all streams which implement ByteBufferReadable
+   * other than FSDataInputStream successfuly support read(ByteBuffer).
+   * This is true for all filesytem clients the hadoop codebase.
+   *
+   * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
+   * check this: only those streams which provide the read(ByteBuffer)
+   * semantics MAY return true for the probe "in:readbytebuffer";
+   * FSDataInputStream will pass the probe down to the underlying stream.
+   *
+   * @param stream stream to probe
+   * @return true if it is safe to a H2SeekableInputStream to access the data
+   */
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+if (stream.hasCapability("in:readbytebuffer")) {

Review Comment:
   We don't have the Hadoop 3..3.0 yet in Parquet. Does it mean we need to hold 
of this PR? 





> Incorrect type checking in HadoopStreams.wrap
> -
>
> Key: PARQUET-2134
> URL: https://issues.apache.org/jira/browse/PARQUET-2134
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.3, 1.10.1, 1.11.2, 1.12.2
>Reporter: Todd Gao
>Priority: Minor
>
> The method 
> [HadoopStreams.wrap|https://github.com/apache/parquet-mr/blob/4d062dc37577e719dcecc666f8e837843e44a9be/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L51]
>  wraps an FSDataInputStream to a SeekableInputStream. 
> It checks whether the underlying stream of the passed  FSDataInputStream 
> implements ByteBufferReadable: if true, wraps the FSDataInputStream to 
> H2SeekableInputStream; otherwise, wraps to H1SeekableInputStream.
> In some cases, we may add another wrapper over FSDataInputStream. For 
> example, 
> {code:java}
> class CustomDataInputStream extends FSDataInputStream {
> public CustomDataInputStream(FSDataInputStream original) {
> super(original);
> }
> }
> {code}
> When we create an FSDataInputStream, whose underlying stream does not 
> implements ByteBufferReadable, and then creates a CustomDataInputStream with 
> it. If we use HadoopStreams.wrap to create a SeekableInputStream, we may get 
> an error like 
> {quote}java.lang.UnsupportedOperationException: Byte-buffer read unsupported 
> by input stream{quote}
> We can fix this by taking recursive checks over the underlying stream of 
> FSDataInputStream.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-06-17 Thread GitBox


shangxinli commented on code in PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#discussion_r900676217


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -50,51 +46,45 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
-byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
-  try {
-return h2SeekableConstructor.newInstance(stream);
-  } catch (InstantiationException | IllegalAccessException e) {
-LOG.warn("Could not instantiate H2SeekableInputStream, falling back to 
byte array reads", e);
-return new H1SeekableInputStream(stream);
-  } catch (InvocationTargetException e) {
-throw new ParquetDecodingException(
-"Could not instantiate H2SeekableInputStream", 
e.getTargetException());
-  }
+if (isWrappedStreamByteBufferReadable(stream)) {
+  return new H2SeekableInputStream(stream);
 } else {
   return new H1SeekableInputStream(stream);
 }
   }
 
-  private static Class getReadableClass() {
-try {
-  return Class.forName("org.apache.hadoop.fs.ByteBufferReadable");
-} catch (ClassNotFoundException | NoClassDefFoundError e) {
-  return null;
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is "the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * That is: all streams which implement ByteBufferReadable
+   * other than FSDataInputStream successfuly support read(ByteBuffer).
+   * This is true for all filesytem clients the hadoop codebase.
+   *
+   * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
+   * check this: only those streams which provide the read(ByteBuffer)
+   * semantics MAY return true for the probe "in:readbytebuffer";
+   * FSDataInputStream will pass the probe down to the underlying stream.
+   *
+   * @param stream stream to probe
+   * @return true if it is safe to a H2SeekableInputStream to access the data
+   */
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+if (stream.hasCapability("in:readbytebuffer")) {

Review Comment:
   We don't have the Hadoop 3..3.0 yet in Parquet. Does it mean we need to hold 
of this PR? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555790#comment-17555790
 ] 

ASF GitHub Bot commented on PARQUET-2158:
-

sunchao commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900576817


##
pom.xml:
##
@@ -76,7 +76,7 @@
 2.13.2.2
 0.14.2
 shaded.parquet
-2.10.1
+3.2.0

Review Comment:
   hmm why 3.2.0, not 3.3.1/3.3.2?





> Upgrade Hadoop dependency to version 3.2.0
> --
>
> Key: PARQUET-2158
> URL: https://issues.apache.org/jira/browse/PARQUET-2158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> Parquet still builds against Hadoop 2.10. This is very out of date and does 
> not work with java 11, let alone later releases.
> Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with 
> java 11, and lines up with active work on  HADOOP-18287,  _Provide a shim 
> library for modern FS APIs_ 
> This will significantly speed up access to columnar data, especially  in 
> cloud stores.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [parquet-mr] sunchao commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread GitBox


sunchao commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900576817


##
pom.xml:
##
@@ -76,7 +76,7 @@
 2.13.2.2
 0.14.2
 shaded.parquet
-2.10.1
+3.2.0

Review Comment:
   hmm why 3.2.0, not 3.3.1/3.3.2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org