[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492997#comment-17492997 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 edited a comment on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205 @shangxinli Thanks fa lot or the review. I have addressed most of the comments. > [](https://github.com/prakharjain09)We need more test to cover old parquet data that doesn't have column index. I couldn't find any existing tests or any existing parquet files in Resource directory which doesn't have column indexes. Could you please give some pointer to similar existing test or some way to create parquet file without column indexes (don't see any options to disable writing column indexes either)? I have added tests for validating row indexes are correct with "column index filtering" disabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] prakharjain09 edited a comment on pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 edited a comment on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205 @shangxinli Thanks fa lot or the review. I have addressed most of the comments. > [](https://github.com/prakharjain09)We need more test to cover old parquet data that doesn't have column index. I couldn't find any existing tests or any existing parquet files in Resource directory which doesn't have column indexes. Could you please give some pointer to similar existing test or some way to create parquet file without column indexes (don't see any options to disable writing column indexes either)? I have added tests for validating row indexes are correct with "column index filtering" disabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492994#comment-17492994 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205 @shangxinli Thanks fa lot or the review. > [](https://github.com/prakharjain09)We need more test to cover old parquet data that doesn't have column index. I couldn't find any existing tests or any existing parquet files in Resource directory which doesn't have column indexes. Could you please give some pointer to similar existing test or some way to create parquet file without column indexes (don't see any options to disable writing column indexes either)? I have added tests for validating row indexes are correct with "column index filtering" disabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] prakharjain09 commented on pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205 @shangxinli Thanks fa lot or the review. > [](https://github.com/prakharjain09)We need more test to cover old parquet data that doesn't have column index. I couldn't find any existing tests or any existing parquet files in Resource directory which doesn't have column indexes. Could you please give some pointer to similar existing test or some way to create parquet file without column indexes (don't see any options to disable writing column indexes either)? I have added tests for validating row indexes are correct with "column index filtering" disabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492983#comment-17492983 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder builder, List use } } - private static ParquetReader createReader(Path file, Filter filter) throws IOException { + public static ParquetReader createReader(Path file, Filter filter) throws IOException { Review comment: This is being used by the new test file - https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129 We use this to create a reader over the test parquet file so that we can call the new ParquetReader.getRowIndex API for unit testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder builder, List use } } - private static ParquetReader createReader(Path file, Filter filter) throws IOException { + public static ParquetReader createReader(Path file, Filter filter) throws IOException { Review comment: This is being used by the new test file - https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129 We use this to create a reader over the test parquet file so that we can call the new ParquetReader.getRowIndex API for unit testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492982#comment-17492982 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder builder, List use } } - private static ParquetReader createReader(Path file, Filter filter) throws IOException { + public static ParquetReader createReader(Path file, Filter filter) throws IOException { Review comment: This is being used by the new test file - https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder builder, List use } } - private static ParquetReader createReader(Path file, Filter filter) throws IOException { + public static ParquetReader createReader(Path file, Filter filter) throws IOException { Review comment: This is being used by the new test file - https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492981#comment-17492981 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498154 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -340,12 +342,21 @@ public static void write(ParquetWriter.Builder builder, List use return users; } - public static List readUsers(ParquetReader.Builder builder) throws IOException { Review comment: fixed - not deleting it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492979#comment-17492979 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807497937 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { Review comment: current is an existing variable which tracks number of rows already processed. It is initialized to 0 at declaration time. So here we are trying to see if it is still 0, that means we haven't processed any row yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498154 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -340,12 +342,21 @@ public static void write(ParquetWriter.Builder builder, List use return users; } - public static List readUsers(ParquetReader.Builder builder) throws IOException { Review comment: fixed - not deleting it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807497937 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { Review comment: current is an existing variable which tracks number of rows already processed. It is initialized to 0 at declaration time. So here we are trying to see if it is still 0, that means we haven't processed any row yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492978#comment-17492978 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496854 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -69,6 +71,8 @@ private long current = 0; private int currentBlock = -1; private ParquetFileReader reader; + private long currentRowIndex = -1L; + private PrimitiveIterator.OfLong rowIndexWithinFileIterator; Review comment: renamed to `rowIdxInFileItr`. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -69,6 +71,8 @@ private long current = 0; private int currentBlock = -1; private ParquetFileReader reader; + private long currentRowIndex = -1L; + private PrimitiveIterator.OfLong rowIndexWithinFileIterator; Review comment: renamed to shorter name. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { + throw new RowIndexFetchedWithoutProcessingRowException("row index can be fetched only after processing a row"); +} +if (rowIndexWithinFileIterator == null) { + throw new RowIndexNotSupportedException("underlying page read store implementation" + +" doesn't support row index generation"); +} +return currentRowIndex; + } + + /** + * Resets the row index iterator based on the current processed row group. + */ + private void resetRowIndexIterator(PageReadStore pages) { +Optional rowIndexOffsetForCurrentRowGroup = pages.getRowIndexOffset(); Review comment: renamed to shorter name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496854 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -69,6 +71,8 @@ private long current = 0; private int currentBlock = -1; private ParquetFileReader reader; + private long currentRowIndex = -1L; + private PrimitiveIterator.OfLong rowIndexWithinFileIterator; Review comment: renamed to `rowIdxInFileItr`. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -69,6 +71,8 @@ private long current = 0; private int currentBlock = -1; private ParquetFileReader reader; + private long currentRowIndex = -1L; + private PrimitiveIterator.OfLong rowIndexWithinFileIterator; Review comment: renamed to shorter name. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { + throw new RowIndexFetchedWithoutProcessingRowException("row index can be fetched only after processing a row"); +} +if (rowIndexWithinFileIterator == null) { + throw new RowIndexNotSupportedException("underlying page read store implementation" + +" doesn't support row index generation"); +} +return currentRowIndex; + } + + /** + * Resets the row index iterator based on the current processed row group. + */ + private void resetRowIndexIterator(PageReadStore pages) { +Optional rowIndexOffsetForCurrentRowGroup = pages.getRowIndexOffset(); Review comment: renamed to shorter name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492977#comment-17492977 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496556 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java ## @@ -248,15 +248,18 @@ public DictionaryPage readDictionaryPage() { private final Map readers = new HashMap(); private final long rowCount; + private final long rowIndexOffset; private final RowRanges rowRanges; - public ColumnChunkPageReadStore(long rowCount) { + public ColumnChunkPageReadStore(long rowCount, long rowIndexOffset) { Review comment: Makes sense - retaining the older constructor. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java ## @@ -265,6 +268,11 @@ public long getRowCount() { return rowCount; } + @Override + public Optional getRowIndexOffset() { +return Optional.of(rowIndexOffset); Review comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492975#comment-17492975 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496428 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilte return readParquetMetadata(from, filter, null, false, 0); } + private Map generateRowGroupOffsets(FileMetaData metaData) { +Map rowGroupOrdinalToRowIdx = new HashMap<>(); +List rowGroups = metaData.getRow_groups(); +if (rowGroups != null) { + long rowIdxSum = 0; + for (int i = 0; i < rowGroups.size(); i++) { +rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum); +rowIdxSum += rowGroups.get(i).getNum_rows(); + } +} +return rowGroupOrdinalToRowIdx; + } + + /** + * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map. + */ + private class FileMetaDataAndRowGroupOffsetInfo { +FileMetaData fileMetadata; +Map rowGroupToRowIndexOffsetMap; +public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, Map rowGroupToRowIndexOffsetMap) { + this.fileMetadata = fileMetadata; + this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap; +} + } + public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter, final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter, final int combinedFooterLength) throws IOException { final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? fileDecryptor.fetchFooterDecryptor() : null); final byte[] encryptedFooterAAD = (encryptedFooter? AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null); -FileMetaData fileMetaData = filter.accept(new MetadataFilterVisitor() { +FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = filter.accept(new MetadataFilterVisitor() { Review comment: The `visit(OffsetMetadataFilter filter)` and `visit(RangeMetadataFilter filter)` returns the "filtered" FileMetadata and so few of the rowGroups might be missing in that. So doing it at the end will generate incorrect rowIndexOffset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496556 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java ## @@ -248,15 +248,18 @@ public DictionaryPage readDictionaryPage() { private final Map readers = new HashMap(); private final long rowCount; + private final long rowIndexOffset; private final RowRanges rowRanges; - public ColumnChunkPageReadStore(long rowCount) { + public ColumnChunkPageReadStore(long rowCount, long rowIndexOffset) { Review comment: Makes sense - retaining the older constructor. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java ## @@ -265,6 +268,11 @@ public long getRowCount() { return rowCount; } + @Override + public Optional getRowIndexOffset() { +return Optional.of(rowIndexOffset); Review comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496428 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilte return readParquetMetadata(from, filter, null, false, 0); } + private Map generateRowGroupOffsets(FileMetaData metaData) { +Map rowGroupOrdinalToRowIdx = new HashMap<>(); +List rowGroups = metaData.getRow_groups(); +if (rowGroups != null) { + long rowIdxSum = 0; + for (int i = 0; i < rowGroups.size(); i++) { +rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum); +rowIdxSum += rowGroups.get(i).getNum_rows(); + } +} +return rowGroupOrdinalToRowIdx; + } + + /** + * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map. + */ + private class FileMetaDataAndRowGroupOffsetInfo { +FileMetaData fileMetadata; +Map rowGroupToRowIndexOffsetMap; +public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, Map rowGroupToRowIndexOffsetMap) { + this.fileMetadata = fileMetadata; + this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap; +} + } + public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter, final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter, final int combinedFooterLength) throws IOException { final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? fileDecryptor.fetchFooterDecryptor() : null); final byte[] encryptedFooterAAD = (encryptedFooter? AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null); -FileMetaData fileMetaData = filter.accept(new MetadataFilterVisitor() { +FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = filter.accept(new MetadataFilterVisitor() { Review comment: The `visit(OffsetMetadataFilter filter)` and `visit(RangeMetadataFilter filter)` returns the "filtered" FileMetadata and so few of the rowGroups might be missing in that. So doing it at the end will generate incorrect rowIndexOffset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492974#comment-17492974 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807495498 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilte return readParquetMetadata(from, filter, null, false, 0); } + private Map generateRowGroupOffsets(FileMetaData metaData) { +Map rowGroupOrdinalToRowIdx = new HashMap<>(); +List rowGroups = metaData.getRow_groups(); +if (rowGroups != null) { + long rowIdxSum = 0; + for (int i = 0; i < rowGroups.size(); i++) { +rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum); +rowIdxSum += rowGroups.get(i).getNum_rows(); + } +} +return rowGroupOrdinalToRowIdx; + } + + /** + * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map. + */ + private class FileMetaDataAndRowGroupOffsetInfo { +FileMetaData fileMetadata; Review comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] prakharjain09 commented on a change in pull request #945: PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader
prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807495498 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilte return readParquetMetadata(from, filter, null, false, 0); } + private Map generateRowGroupOffsets(FileMetaData metaData) { +Map rowGroupOrdinalToRowIdx = new HashMap<>(); +List rowGroups = metaData.getRow_groups(); +if (rowGroups != null) { + long rowIdxSum = 0; + for (int i = 0; i < rowGroups.size(); i++) { +rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum); +rowIdxSum += rowGroups.get(i).getNum_rows(); + } +} +return rowGroupOrdinalToRowIdx; + } + + /** + * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map. + */ + private class FileMetaDataAndRowGroupOffsetInfo { +FileMetaData fileMetadata; Review comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Updated invitation: Parquet Sync @ Tue Mar 1, 2022 9am - 10am (PST) (dev@parquet.apache.org)
BEGIN:VCALENDAR PRODID:-//Google Inc//Google Calendar 70.9054//EN VERSION:2.0 CALSCALE:GREGORIAN METHOD:REQUEST BEGIN:VTIMEZONE TZID:America/Los_Angeles X-LIC-LOCATION:America/Los_Angeles BEGIN:DAYLIGHT TZOFFSETFROM:-0800 TZOFFSETTO:-0700 TZNAME:PDT DTSTART:19700308T02 RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU END:DAYLIGHT BEGIN:STANDARD TZOFFSETFROM:-0700 TZOFFSETTO:-0800 TZNAME:PST DTSTART:19701101T02 RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTART;TZID=America/Los_Angeles:20220301T09 DTEND;TZID=America/Los_Angeles:20220301T10 DTSTAMP:20220215T210217Z ORGANIZER;CN=sha...@uber.com:mailto:sha...@uber.com UID:tvfaq819qc81p0ljk2jkpvhc9n_r20220223t170...@google.com ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TR UE;CN=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber. com_53454131313931326e6441766530387468426c616b656c793756432d343836313237@re source.calendar.google.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=sha...@uber.com;X-NUM-GUESTS=0:mailto:sha...@uber.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=Matthew Turner;X-NUM-GUESTS=0:mailto:matthew.m.tur...@outlook.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=wesmck...@gmail.com;X-NUM-GUESTS=0:mailto:wesmck...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszk y...@cloudera.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN="Lekshmi Narayanan, Arun Balajiee";X-NUM-GUESTS=0:mailto:arl122@pit t.edu ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekrusejason@gmail .com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=chao.apa...@gmail.com;X-NUM-GUESTS=0:mailto:chao.apa...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN="Wang, Yuming";X-NUM-GUESTS=0:mailto:yumw...@ebay.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=fo...@driesprongen.nl;X-NUM-GUESTS=0:mailto:fo...@driesprongen.nl ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=aniskodedoss...@etsy.com;X-NUM-GUESTS=0:mailto:aniskodedossett@etsy .com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=Ivan Gavryliuk;X-NUM-GUESTS=0:mailto:i...@isolineltd.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=Brian Bowman;X-NUM-GUESTS=0:mailto:brian.bow...@sas.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=jiashenzz...@gmail.com;X-NUM-GUESTS=0:mailto:jiashenzz...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=vinoo.gan...@gmail.com;X-NUM-GUESTS=0:mailto:vinoo.gan...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=hadrien.k...@sonat.no;X-NUM-GUESTS=0:mailto:hadrien.k...@sonat.no ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=py...@pinterest.com;X-NUM-GUESTS=0:mailto:py...@pinterest.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=jorgecarlei...@gmail.com;X-NUM-GUESTS=0:mailto:jorgecarleitao@gmail .com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=dev@parquet.apache.org;X-NUM-GUESTS=0:mailto:dev@parquet.apache.org ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=Robert Kruszewski;X-NUM-GUESTS=0:mailto:robe...@palantir.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=Revin Chalil;X-NUM-GUESTS=0:mailto:revin.cha...@microsoft.com X-MICROSOFT-CDO-OWNERAPPTID:-606989481 RECURRENCE-ID;TZID=America/Los_Angeles:20220223T09 CREATED:20200210T155820Z DESCRIPTION:Xinli shang is inviting you to a scheduled Zoom meet ing.Join Zoom Meeting - password is requiredhttps://ub er.zoom.us/j/3523778975">https://uber.zoom.us/j/3523778975?pwd=anhscnBNbFpD aUprQkZ1T3RLRzBPQT09Meeting ID: 352 377 8975Password: 03011 5One tap mobile+16699006833\,\,3523778975# US (San
[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding
[ https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492822#comment-17492822 ] ASF GitHub Bot commented on PARQUET-2120: - rshkv commented on pull request #946: URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533 Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > parquet-cli dictionary command fails on pages without dictionary encoding > - > > Key: PARQUET-2120 > URL: https://issues.apache.org/jira/browse/PARQUET-2120 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.12.2 >Reporter: Willi Raschkowski >Priority: Minor > > parquet-cli's {{dictionary}} command fails with an NPE if a page does not > have dictionary encoding: > {code} > $ parquet dictionary --column col a-b-c.snappy.parquet > Unknown error > java.lang.NullPointerException: Cannot invoke > "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" > is null > at > org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78) > at org.apache.parquet.cli.Main.run(Main.java:155) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:185) > $ parquet meta a-b-c.snappy.parquet > ... > Row group 0: count: 1 46.00 B records start: 4 total: 46 B > > type encodings count avg size nulls min / max > col BINARYS _ 1 46.00 B0 "a" / "a" > Row group 1: count: 200 0.34 B records start: 50 total: 69 B > > type encodings count avg size nulls min / max > col BINARYS _ R 200 0.34 B 0 "b" / "c" > {code} > (Note the missing {{R}} / dictionary encoding on that first page.) > Someone familiar with Parquet might guess from the NPE that there's no > dictionary encoding. But for files that mix pages with and without dictionary > encoding (like above), the command will fail before getting to pages that > actually have dictionaries. > The problem is that [this > line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76] > assumes {{readDictionaryPage}} always returns a page and doesn't handle when > it does not, i.e. when it returns {{null}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] rshkv commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages
rshkv commented on pull request #946: URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533 Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules
[ https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492819#comment-17492819 ] ASF GitHub Bot commented on PARQUET-2121: - sekikn commented on a change in pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806426714 ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Thank you for the comment, @shangxinli! I updated the PR. Instead of removing the lines, I just added '(deprecated)' to parquet-cascading* and parquet-scrooge in README.md, just like parquet-hive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove descriptions for the removed modules > --- > > Key: PARQUET-2121 > URL: https://issues.apache.org/jira/browse/PARQUET-2121 > Project: Parquet > Issue Type: Improvement >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > > PARQUET-2020 removed some deprecated modules, but the related descriptions > still remain in some documents. They should be removed since their existence > is misleading. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] sekikn commented on a change in pull request #947: PARQUET-2121: Remove descriptions for the removed modules
sekikn commented on a change in pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806426714 ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Thank you for the comment, @shangxinli! I updated the PR. Instead of removing the lines, I just added '(deprecated)' to parquet-cascading* and parquet-scrooge in README.md, just like parquet-hive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules
[ https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492809#comment-17492809 ] ASF GitHub Bot commented on PARQUET-2121: - shangxinli commented on pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#issuecomment-1039353753 @sekikn Thanks for working on it! Just leave some minor comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove descriptions for the removed modules > --- > > Key: PARQUET-2121 > URL: https://issues.apache.org/jira/browse/PARQUET-2121 > Project: Parquet > Issue Type: Improvement >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > > PARQUET-2020 removed some deprecated modules, but the related descriptions > still remain in some documents. They should be removed since their existence > is misleading. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] shangxinli commented on pull request #947: PARQUET-2121: Remove descriptions for the removed modules
shangxinli commented on pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#issuecomment-1039353753 @sekikn Thanks for working on it! Just leave some minor comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules
[ https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492807#comment-17492807 ] ASF GitHub Bot commented on PARQUET-2121: - shangxinli commented on a change in pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806064746 ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Since the code is still there, do you think we can just add '(deprecated)' here/ ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Since the code is still there, do you think we can just add '(deprecated)' here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove descriptions for the removed modules > --- > > Key: PARQUET-2121 > URL: https://issues.apache.org/jira/browse/PARQUET-2121 > Project: Parquet > Issue Type: Improvement >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > > PARQUET-2020 removed some deprecated modules, but the related descriptions > still remain in some documents. They should be removed since their existence > is misleading. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] shangxinli commented on a change in pull request #947: PARQUET-2121: Remove descriptions for the removed modules
shangxinli commented on a change in pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806064746 ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Since the code is still there, do you think we can just add '(deprecated)' here/ ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Since the code is still there, do you think we can just add '(deprecated)' here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding
[ https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492797#comment-17492797 ] ASF GitHub Bot commented on PARQUET-2120: - shangxinli commented on pull request #946: URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003 Thanks for working on it! Can you squash the commits? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > parquet-cli dictionary command fails on pages without dictionary encoding > - > > Key: PARQUET-2120 > URL: https://issues.apache.org/jira/browse/PARQUET-2120 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.12.2 >Reporter: Willi Raschkowski >Priority: Minor > > parquet-cli's {{dictionary}} command fails with an NPE if a page does not > have dictionary encoding: > {code} > $ parquet dictionary --column col a-b-c.snappy.parquet > Unknown error > java.lang.NullPointerException: Cannot invoke > "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" > is null > at > org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78) > at org.apache.parquet.cli.Main.run(Main.java:155) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:185) > $ parquet meta a-b-c.snappy.parquet > ... > Row group 0: count: 1 46.00 B records start: 4 total: 46 B > > type encodings count avg size nulls min / max > col BINARYS _ 1 46.00 B0 "a" / "a" > Row group 1: count: 200 0.34 B records start: 50 total: 69 B > > type encodings count avg size nulls min / max > col BINARYS _ R 200 0.34 B 0 "b" / "c" > {code} > (Note the missing {{R}} / dictionary encoding on that first page.) > Someone familiar with Parquet might guess from the NPE that there's no > dictionary encoding. But for files that mix pages with and without dictionary > encoding (like above), the command will fail before getting to pages that > actually have dictionaries. > The problem is that [this > line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76] > assumes {{readDictionaryPage}} always returns a page and doesn't handle when > it does not, i.e. when it returns {{null}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] shangxinli commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages
shangxinli commented on pull request #946: URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003 Thanks for working on it! Can you squash the commits? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (PARQUET-2124) Bad DCHECK For Intermixed Dictionary Encoding
[ https://issues.apache.org/jira/browse/PARQUET-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved PARQUET-2124. - Fix Version/s: cpp-8.0.0 Resolution: Fixed Issue resolved by pull request 12427 [https://github.com/apache/arrow/pull/12427] > Bad DCHECK For Intermixed Dictionary Encoding > - > > Key: PARQUET-2124 > URL: https://issues.apache.org/jira/browse/PARQUET-2124 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: William Butler >Assignee: William Butler >Priority: Minor > Labels: pull-request-available > Fix For: cpp-8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Parquet CPP has a DCHECK for a dictionary encoded page coming after a > non-dictionary encoded page. This is bad because the DCHECK can be triggered > by Parquet files that have a column that has a dictionary page, then a > non-dictionary encoded page, then a page of dictionary encoded > values(indices). Fuzzing found such a file. While this could be turned into > an exception, I don't see anything in the Parquet specification that > prohibits such an occurrence of pages. > This situation has brought up on the mailing list > before([https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos)] > and it seems like this is valid but nobody is doing it. > In the PR that added this > check([https://github.com/apache/parquet-cpp/pull/73)] it was noted that the > check is probably not needed. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (PARQUET-2123) Invalid memory access in ScanFileContents
[ https://issues.apache.org/jira/browse/PARQUET-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved PARQUET-2123. - Fix Version/s: cpp-8.0.0 Resolution: Fixed Issue resolved by pull request 12423 [https://github.com/apache/arrow/pull/12423] > Invalid memory access in ScanFileContents > - > > Key: PARQUET-2123 > URL: https://issues.apache.org/jira/browse/PARQUET-2123 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: William Butler >Assignee: William Butler >Priority: Minor > Labels: pull-request-available > Fix For: cpp-8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > When a Parquet file has 0 columns, ScanFileContents will try to access the > 0th element of a size 0 vector. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492593#comment-17492593 ] Ze'ev Maor commented on PARQUET-2122: - [~junjie] thanks, that worked, though it does seem odd that a MAX size on bloom filter of 1MB would actually result in 1MB used by a Bloom filter on a column with cardinality of just 14 isn't it? > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding
[ https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492564#comment-17492564 ] ASF GitHub Bot commented on PARQUET-2120: - rshkv commented on pull request #946: URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533 Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > parquet-cli dictionary command fails on pages without dictionary encoding > - > > Key: PARQUET-2120 > URL: https://issues.apache.org/jira/browse/PARQUET-2120 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.12.2 >Reporter: Willi Raschkowski >Priority: Minor > > parquet-cli's {{dictionary}} command fails with an NPE if a page does not > have dictionary encoding: > {code} > $ parquet dictionary --column col a-b-c.snappy.parquet > Unknown error > java.lang.NullPointerException: Cannot invoke > "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" > is null > at > org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78) > at org.apache.parquet.cli.Main.run(Main.java:155) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:185) > $ parquet meta a-b-c.snappy.parquet > ... > Row group 0: count: 1 46.00 B records start: 4 total: 46 B > > type encodings count avg size nulls min / max > col BINARYS _ 1 46.00 B0 "a" / "a" > Row group 1: count: 200 0.34 B records start: 50 total: 69 B > > type encodings count avg size nulls min / max > col BINARYS _ R 200 0.34 B 0 "b" / "c" > {code} > (Note the missing {{R}} / dictionary encoding on that first page.) > Someone familiar with Parquet might guess from the NPE that there's no > dictionary encoding. But for files that mix pages with and without dictionary > encoding (like above), the command will fail before getting to pages that > actually have dictionaries. > The problem is that [this > line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76] > assumes {{readDictionaryPage}} always returns a page and doesn't handle when > it does not, i.e. when it returns {{null}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] rshkv commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages
rshkv commented on pull request #946: URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533 Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org