[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492997#comment-17492997
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 edited a comment on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205


   @shangxinli Thanks fa lot or the review. I have addressed most of the 
comments.
   
   > [](https://github.com/prakharjain09)We need more test to cover old parquet 
data that doesn't have column index.
   
   I couldn't find any existing tests or any existing parquet files in Resource 
directory which doesn't have column indexes. Could you please give some pointer 
to similar existing test or some way to create parquet file without column 
indexes (don't see any options to disable writing column indexes either)?
   I have added tests for validating row indexes are correct with "column index 
filtering" disabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] prakharjain09 edited a comment on pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 edited a comment on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205


   @shangxinli Thanks fa lot or the review. I have addressed most of the 
comments.
   
   > [](https://github.com/prakharjain09)We need more test to cover old parquet 
data that doesn't have column index.
   
   I couldn't find any existing tests or any existing parquet files in Resource 
directory which doesn't have column indexes. Could you please give some pointer 
to similar existing test or some way to create parquet file without column 
indexes (don't see any options to disable writing column indexes either)?
   I have added tests for validating row indexes are correct with "column index 
filtering" disabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492994#comment-17492994
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205


   @shangxinli Thanks fa lot or the review.
   
   > [](https://github.com/prakharjain09)We need more test to cover old parquet 
data that doesn't have column index.
   
   I couldn't find any existing tests or any existing parquet files in Resource 
directory which doesn't have column indexes. Could you please give some pointer 
to similar existing test or some way to create parquet file without column 
indexes (don't see any options to disable writing column indexes either)?
   I have added tests for validating row indexes are correct with "column index 
filtering" disabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] prakharjain09 commented on pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205


   @shangxinli Thanks fa lot or the review.
   
   > [](https://github.com/prakharjain09)We need more test to cover old parquet 
data that doesn't have column index.
   
   I couldn't find any existing tests or any existing parquet files in Resource 
directory which doesn't have column indexes. Could you please give some pointer 
to similar existing test or some way to create parquet file without column 
indexes (don't see any options to disable writing column indexes either)?
   I have added tests for validating row indexes are correct with "column index 
filtering" disabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492983#comment-17492983
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder 
builder, List use
 }
   }
 
-  private static ParquetReader createReader(Path file, Filter filter) 
throws IOException {
+  public static ParquetReader createReader(Path file, Filter filter) 
throws IOException {

Review comment:
   This is being used by the new test file - 
https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129
   We use this to create a reader over the test parquet file so that we can 
call the new ParquetReader.getRowIndex API for unit testing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder 
builder, List use
 }
   }
 
-  private static ParquetReader createReader(Path file, Filter filter) 
throws IOException {
+  public static ParquetReader createReader(Path file, Filter filter) 
throws IOException {

Review comment:
   This is being used by the new test file - 
https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129
   We use this to create a reader over the test parquet file so that we can 
call the new ParquetReader.getRowIndex API for unit testing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492982#comment-17492982
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder 
builder, List use
 }
   }
 
-  private static ParquetReader createReader(Path file, Filter filter) 
throws IOException {
+  public static ParquetReader createReader(Path file, Filter filter) 
throws IOException {

Review comment:
   This is being used by the new test file - 
https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder 
builder, List use
 }
   }
 
-  private static ParquetReader createReader(Path file, Filter filter) 
throws IOException {
+  public static ParquetReader createReader(Path file, Filter filter) 
throws IOException {

Review comment:
   This is being used by the new test file - 
https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492981#comment-17492981
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498154



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -340,12 +342,21 @@ public static void write(ParquetWriter.Builder 
builder, List use
 return users;
   }
 
-  public static List readUsers(ParquetReader.Builder builder) 
throws IOException {

Review comment:
   fixed - not deleting it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492979#comment-17492979
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807497937



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {

Review comment:
   current is an existing variable which tracks number of rows already 
processed. It is initialized to 0 at declaration time. So here we are trying to 
see if it is still 0, that means we haven't processed any row yet.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498154



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -340,12 +342,21 @@ public static void write(ParquetWriter.Builder 
builder, List use
 return users;
   }
 
-  public static List readUsers(ParquetReader.Builder builder) 
throws IOException {

Review comment:
   fixed - not deleting it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807497937



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {

Review comment:
   current is an existing variable which tracks number of rows already 
processed. It is initialized to 0 at declaration time. So here we are trying to 
see if it is still 0, that means we haven't processed any row yet.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492978#comment-17492978
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496854



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -69,6 +71,8 @@
   private long current = 0;
   private int currentBlock = -1;
   private ParquetFileReader reader;
+  private long currentRowIndex = -1L;
+  private PrimitiveIterator.OfLong rowIndexWithinFileIterator;

Review comment:
   renamed to `rowIdxInFileItr`.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -69,6 +71,8 @@
   private long current = 0;
   private int currentBlock = -1;
   private ParquetFileReader reader;
+  private long currentRowIndex = -1L;
+  private PrimitiveIterator.OfLong rowIndexWithinFileIterator;

Review comment:
   renamed to shorter name.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {
+  throw new RowIndexFetchedWithoutProcessingRowException("row index can be 
fetched only after processing a row");
+}
+if (rowIndexWithinFileIterator == null) {
+  throw new RowIndexNotSupportedException("underlying page read store 
implementation" +
+" doesn't support row index generation");
+}
+return currentRowIndex;
+  }
+
+  /**
+   * Resets the row index iterator based on the current processed row group.
+   */
+  private void resetRowIndexIterator(PageReadStore pages) {
+Optional rowIndexOffsetForCurrentRowGroup = 
pages.getRowIndexOffset();

Review comment:
   renamed to shorter name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496854



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -69,6 +71,8 @@
   private long current = 0;
   private int currentBlock = -1;
   private ParquetFileReader reader;
+  private long currentRowIndex = -1L;
+  private PrimitiveIterator.OfLong rowIndexWithinFileIterator;

Review comment:
   renamed to `rowIdxInFileItr`.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -69,6 +71,8 @@
   private long current = 0;
   private int currentBlock = -1;
   private ParquetFileReader reader;
+  private long currentRowIndex = -1L;
+  private PrimitiveIterator.OfLong rowIndexWithinFileIterator;

Review comment:
   renamed to shorter name.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {
+  throw new RowIndexFetchedWithoutProcessingRowException("row index can be 
fetched only after processing a row");
+}
+if (rowIndexWithinFileIterator == null) {
+  throw new RowIndexNotSupportedException("underlying page read store 
implementation" +
+" doesn't support row index generation");
+}
+return currentRowIndex;
+  }
+
+  /**
+   * Resets the row index iterator based on the current processed row group.
+   */
+  private void resetRowIndexIterator(PageReadStore pages) {
+Optional rowIndexOffsetForCurrentRowGroup = 
pages.getRowIndexOffset();

Review comment:
   renamed to shorter name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492977#comment-17492977
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496556



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
##
@@ -248,15 +248,18 @@ public DictionaryPage readDictionaryPage() {
 
   private final Map readers = new 
HashMap();
   private final long rowCount;
+  private final long rowIndexOffset;
   private final RowRanges rowRanges;
 
-  public ColumnChunkPageReadStore(long rowCount) {
+  public ColumnChunkPageReadStore(long rowCount, long rowIndexOffset) {

Review comment:
   Makes sense - retaining the older constructor.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
##
@@ -265,6 +268,11 @@ public long getRowCount() {
 return rowCount;
   }
 
+  @Override
+  public Optional getRowIndexOffset() {
+return Optional.of(rowIndexOffset);

Review comment:
   done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492975#comment-17492975
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496428



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final 
InputStream from, MetadataFilte
 return readParquetMetadata(from, filter, null, false, 0);
   }
 
+  private Map generateRowGroupOffsets(FileMetaData metaData) {
+Map rowGroupOrdinalToRowIdx = new HashMap<>();
+List rowGroups = metaData.getRow_groups();
+if (rowGroups != null) {
+  long rowIdxSum = 0;
+  for (int i = 0; i < rowGroups.size(); i++) {
+rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum);
+rowIdxSum += rowGroups.get(i).getNum_rows();
+  }
+}
+return rowGroupOrdinalToRowIdx;
+  }
+
+  /**
+   * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map.
+   */
+  private class FileMetaDataAndRowGroupOffsetInfo {
+FileMetaData fileMetadata;
+Map rowGroupToRowIndexOffsetMap;
+public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, 
Map rowGroupToRowIndexOffsetMap) {
+  this.fileMetadata = fileMetadata;
+  this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap;
+}
+  }
+
   public ParquetMetadata readParquetMetadata(final InputStream from, 
MetadataFilter filter,
   final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter,
   final int combinedFooterLength) throws IOException {
 
 final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? 
fileDecryptor.fetchFooterDecryptor() : null);
 final byte[] encryptedFooterAAD = (encryptedFooter? 
AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null);
 
-FileMetaData fileMetaData = filter.accept(new 
MetadataFilterVisitor() {
+FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = 
filter.accept(new MetadataFilterVisitor() {

Review comment:
   The `visit(OffsetMetadataFilter filter)` and `visit(RangeMetadataFilter 
filter)` returns the "filtered" FileMetadata and so few of the rowGroups might 
be missing in that. So doing it at the end will generate incorrect 
rowIndexOffset.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496556



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
##
@@ -248,15 +248,18 @@ public DictionaryPage readDictionaryPage() {
 
   private final Map readers = new 
HashMap();
   private final long rowCount;
+  private final long rowIndexOffset;
   private final RowRanges rowRanges;
 
-  public ColumnChunkPageReadStore(long rowCount) {
+  public ColumnChunkPageReadStore(long rowCount, long rowIndexOffset) {

Review comment:
   Makes sense - retaining the older constructor.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
##
@@ -265,6 +268,11 @@ public long getRowCount() {
 return rowCount;
   }
 
+  @Override
+  public Optional getRowIndexOffset() {
+return Optional.of(rowIndexOffset);

Review comment:
   done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496428



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final 
InputStream from, MetadataFilte
 return readParquetMetadata(from, filter, null, false, 0);
   }
 
+  private Map generateRowGroupOffsets(FileMetaData metaData) {
+Map rowGroupOrdinalToRowIdx = new HashMap<>();
+List rowGroups = metaData.getRow_groups();
+if (rowGroups != null) {
+  long rowIdxSum = 0;
+  for (int i = 0; i < rowGroups.size(); i++) {
+rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum);
+rowIdxSum += rowGroups.get(i).getNum_rows();
+  }
+}
+return rowGroupOrdinalToRowIdx;
+  }
+
+  /**
+   * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map.
+   */
+  private class FileMetaDataAndRowGroupOffsetInfo {
+FileMetaData fileMetadata;
+Map rowGroupToRowIndexOffsetMap;
+public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, 
Map rowGroupToRowIndexOffsetMap) {
+  this.fileMetadata = fileMetadata;
+  this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap;
+}
+  }
+
   public ParquetMetadata readParquetMetadata(final InputStream from, 
MetadataFilter filter,
   final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter,
   final int combinedFooterLength) throws IOException {
 
 final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? 
fileDecryptor.fetchFooterDecryptor() : null);
 final byte[] encryptedFooterAAD = (encryptedFooter? 
AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null);
 
-FileMetaData fileMetaData = filter.accept(new 
MetadataFilterVisitor() {
+FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = 
filter.accept(new MetadataFilterVisitor() {

Review comment:
   The `visit(OffsetMetadataFilter filter)` and `visit(RangeMetadataFilter 
filter)` returns the "filtered" FileMetadata and so few of the rowGroups might 
be missing in that. So doing it at the end will generate incorrect 
rowIndexOffset.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492974#comment-17492974
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807495498



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final 
InputStream from, MetadataFilte
 return readParquetMetadata(from, filter, null, false, 0);
   }
 
+  private Map generateRowGroupOffsets(FileMetaData metaData) {
+Map rowGroupOrdinalToRowIdx = new HashMap<>();
+List rowGroups = metaData.getRow_groups();
+if (rowGroups != null) {
+  long rowIdxSum = 0;
+  for (int i = 0; i < rowGroups.size(); i++) {
+rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum);
+rowIdxSum += rowGroups.get(i).getNum_rows();
+  }
+}
+return rowGroupOrdinalToRowIdx;
+  }
+
+  /**
+   * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map.
+   */
+  private class FileMetaDataAndRowGroupOffsetInfo {
+FileMetaData fileMetadata;

Review comment:
   done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader

2022-02-15 Thread GitBox


prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807495498



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final 
InputStream from, MetadataFilte
 return readParquetMetadata(from, filter, null, false, 0);
   }
 
+  private Map generateRowGroupOffsets(FileMetaData metaData) {
+Map rowGroupOrdinalToRowIdx = new HashMap<>();
+List rowGroups = metaData.getRow_groups();
+if (rowGroups != null) {
+  long rowIdxSum = 0;
+  for (int i = 0; i < rowGroups.size(); i++) {
+rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum);
+rowIdxSum += rowGroups.get(i).getNum_rows();
+  }
+}
+return rowGroupOrdinalToRowIdx;
+  }
+
+  /**
+   * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map.
+   */
+  private class FileMetaDataAndRowGroupOffsetInfo {
+FileMetaData fileMetadata;

Review comment:
   done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Updated invitation: Parquet Sync @ Tue Mar 1, 2022 9am - 10am (PST) (dev@parquet.apache.org)

2022-02-15 Thread shangx
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20220301T09
DTEND;TZID=America/Los_Angeles:20220301T10
DTSTAMP:20220215T210217Z
ORGANIZER;CN=sha...@uber.com:mailto:sha...@uber.com
UID:tvfaq819qc81p0ljk2jkpvhc9n_r20220223t170...@google.com
ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TR
 UE;CN=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.
 com_53454131313931326e6441766530387468426c616b656c793756432d343836313237@re
 source.calendar.google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=sha...@uber.com;X-NUM-GUESTS=0:mailto:sha...@uber.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Matthew Turner;X-NUM-GUESTS=0:mailto:matthew.m.tur...@outlook.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=wesmck...@gmail.com;X-NUM-GUESTS=0:mailto:wesmck...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszk
 y...@cloudera.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN="Lekshmi Narayanan, Arun Balajiee";X-NUM-GUESTS=0:mailto:arl122@pit
 t.edu
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekrusejason@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=chao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN="Wang, Yuming";X-NUM-GUESTS=0:mailto:yumw...@ebay.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=fo...@driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=aniskodedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedossett@etsy
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Ivan Gavryliuk;X-NUM-GUESTS=0:mailto:i...@isolineltd.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Brian Bowman;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=jiashenzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=vinoo.gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=hadrien.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=py...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=jorgecarlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarleitao@gmail
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=dev@parquet.apache.org;X-NUM-GUESTS=0:mailto:dev@parquet.apache.org
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Robert Kruszewski;X-NUM-GUESTS=0:mailto:robe...@palantir.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=Revin Chalil;X-NUM-GUESTS=0:mailto:revin.cha...@microsoft.com
X-MICROSOFT-CDO-OWNERAPPTID:-606989481
RECURRENCE-ID;TZID=America/Los_Angeles:20220223T09
CREATED:20200210T155820Z
DESCRIPTION:Xinli shang is inviting you to a scheduled Zoom meet
 ing.Join Zoom Meeting - password is requiredhttps://ub
 er.zoom.us/j/3523778975">https://uber.zoom.us/j/3523778975?pwd=anhscnBNbFpD
 aUprQkZ1T3RLRzBPQT09Meeting ID: 352 377 8975Password: 03011
 5One tap mobile+16699006833\,\,3523778975# US (San 

[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492822#comment-17492822
 ] 

ASF GitHub Bot commented on PARQUET-2120:
-

rshkv commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533


   Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] rshkv commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

2022-02-15 Thread GitBox


rshkv commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533


   Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492819#comment-17492819
 ] 

ASF GitHub Bot commented on PARQUET-2121:
-

sekikn commented on a change in pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806426714



##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Thank you for the comment, @shangxinli! I updated the PR.
   Instead of removing the lines, I just added '(deprecated)' to 
parquet-cascading* and parquet-scrooge in README.md, just like parquet-hive.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove descriptions for the removed modules
> ---
>
> Key: PARQUET-2121
> URL: https://issues.apache.org/jira/browse/PARQUET-2121
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>
> PARQUET-2020 removed some deprecated modules, but the related descriptions 
> still remain in some documents. They should be removed since their existence 
> is misleading.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] sekikn commented on a change in pull request #947: PARQUET-2121: Remove descriptions for the removed modules

2022-02-15 Thread GitBox


sekikn commented on a change in pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806426714



##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Thank you for the comment, @shangxinli! I updated the PR.
   Instead of removing the lines, I just added '(deprecated)' to 
parquet-cascading* and parquet-scrooge in README.md, just like parquet-hive.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492809#comment-17492809
 ] 

ASF GitHub Bot commented on PARQUET-2121:
-

shangxinli commented on pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#issuecomment-1039353753


   @sekikn Thanks for working on it! Just leave some minor comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove descriptions for the removed modules
> ---
>
> Key: PARQUET-2121
> URL: https://issues.apache.org/jira/browse/PARQUET-2121
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>
> PARQUET-2020 removed some deprecated modules, but the related descriptions 
> still remain in some documents. They should be removed since their existence 
> is misleading.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] shangxinli commented on pull request #947: PARQUET-2121: Remove descriptions for the removed modules

2022-02-15 Thread GitBox


shangxinli commented on pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#issuecomment-1039353753


   @sekikn Thanks for working on it! Just leave some minor comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492807#comment-17492807
 ] 

ASF GitHub Bot commented on PARQUET-2121:
-

shangxinli commented on a change in pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806064746



##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Since the code is still there, do you think we can just add 
'(deprecated)' here/

##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Since the code is still there, do you think we can just add 
'(deprecated)' here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove descriptions for the removed modules
> ---
>
> Key: PARQUET-2121
> URL: https://issues.apache.org/jira/browse/PARQUET-2121
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>
> PARQUET-2020 removed some deprecated modules, but the related descriptions 
> still remain in some documents. They should be removed since their existence 
> is misleading.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] shangxinli commented on a change in pull request #947: PARQUET-2121: Remove descriptions for the removed modules

2022-02-15 Thread GitBox


shangxinli commented on a change in pull request #947:
URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806064746



##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Since the code is still there, do you think we can just add 
'(deprecated)' here/

##
File path: README.md
##
@@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration

Review comment:
   Since the code is still there, do you think we can just add 
'(deprecated)' here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492797#comment-17492797
 ] 

ASF GitHub Bot commented on PARQUET-2120:
-

shangxinli commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003


   Thanks for working on it! Can you squash the commits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] shangxinli commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

2022-02-15 Thread GitBox


shangxinli commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003


   Thanks for working on it! Can you squash the commits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (PARQUET-2124) Bad DCHECK For Intermixed Dictionary Encoding

2022-02-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2124.
-
Fix Version/s: cpp-8.0.0
   Resolution: Fixed

Issue resolved by pull request 12427
[https://github.com/apache/arrow/pull/12427]

> Bad DCHECK For Intermixed Dictionary Encoding
> -
>
> Key: PARQUET-2124
> URL: https://issues.apache.org/jira/browse/PARQUET-2124
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: William Butler
>Assignee: William Butler
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parquet CPP has a DCHECK for a dictionary encoded page coming after a 
> non-dictionary encoded page. This is bad because the DCHECK can be triggered 
> by Parquet files that have a column that has a dictionary page, then a 
> non-dictionary encoded page, then a page of dictionary encoded 
> values(indices). Fuzzing found such a file. While this could be turned into 
> an exception, I don't see anything in the Parquet specification that 
> prohibits such an occurrence of pages.
> This situation has brought up on the mailing list 
> before([https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos)] 
> and it seems like this is valid but nobody is doing it.
> In the PR that added this 
> check([https://github.com/apache/parquet-cpp/pull/73)] it was noted that the 
> check is probably not needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2123) Invalid memory access in ScanFileContents

2022-02-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2123.
-
Fix Version/s: cpp-8.0.0
   Resolution: Fixed

Issue resolved by pull request 12423
[https://github.com/apache/arrow/pull/12423]

> Invalid memory access in ScanFileContents
> -
>
> Key: PARQUET-2123
> URL: https://issues.apache.org/jira/browse/PARQUET-2123
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: William Butler
>Assignee: William Butler
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a Parquet file has 0 columns, ScanFileContents will try to access the 
> 0th element of a size 0 vector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-02-15 Thread Ze'ev Maor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492593#comment-17492593
 ] 

Ze'ev Maor commented on PARQUET-2122:
-

[~junjie] thanks, that worked, though it does seem odd that a MAX size on bloom 
filter of 1MB would actually result in 1MB used by a Bloom filter on a column 
with cardinality of just 14 isn't it?

> Adding Bloom filter to small Parquet file bloats in size X1700
> --
>
> Key: PARQUET-2122
> URL: https://issues.apache.org/jira/browse/PARQUET-2122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Ze'ev Maor
>Priority: Critical
> Attachments: data.csv, data_index_bloom.parquet
>
>
> Converting a small, 14 rows/1 string column csv file to Parquet without bloom 
> filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to 
> ParquetWriter then yields a 1049197B file.
> It isn't clear what the extra space is used by.
> Attached csv and bloated Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492564#comment-17492564
 ] 

ASF GitHub Bot commented on PARQUET-2120:
-

rshkv commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533


   Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] rshkv commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

2022-02-15 Thread GitBox


rshkv commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533


   Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org