[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697187#comment-17697187 ] ASF GitHub Bot commented on PARQUET-2252: - rdblue commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1457264814 Thanks, @zhongyujiang! > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697186#comment-17697186 ] ASF GitHub Bot commented on PARQUET-2252: - rdblue merged PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038 > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696677#comment-17696677 ] ASF GitHub Bot commented on PARQUET-2252: - zhongyujiang commented on code in PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1125864270 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int blockIndex) throws IOException { } RowRanges rowRanges = getRowRanges(blockIndex); +return readFilteredRowGroup(blockIndex, rowRanges); + } + + /** + * Reads all the columns requested from the specified row group. It may skip specific pages based on the + * {@code rowRanges} passed in. As the rows are not aligned among the pages of the different columns row + * synchronization might be required. See the documentation of the class SynchronizingColumnReader for details. + * + * @param blockIndex the index of the requested block + * @param rowRanges the row ranges to be read from the requested block + * @return the PageReadStore which can provide PageReaders for each column or null if there are no rows in this block + * @throws IOException if an error occurs while reading + * @throws IllegalArgumentException if the {@code blockIndex} is invalid or the {@code rowRanges} is null + */ + public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, RowRanges rowRanges) throws IOException { +if (blockIndex < 0 || blockIndex >= blocks.size()) { + throw new IllegalArgumentException(String.format("Invalid block index %s, the valid block index range are: " + +"[%s, %s]", blockIndex, 0, blocks.size() - 1)); +} + +if (Objects.isNull(rowRanges)) { + throw new IllegalArgumentException("RowRanges must not be null"); +} + +BlockMetaData block = blocks.get(blockIndex); +if (block.getRowCount() == 0L) { + throw new ParquetEmptyBlockException("Illegal row group of 0 rows"); Review Comment: Currently, `readRowGroup(int blockIndex)` and `readFilteredRowGroup(int blockIndex)` also throw an exception when handling empty row groups, should we make them also return null instead of throwing an exception to be consistent? > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696626#comment-17696626 ] ASF GitHub Bot commented on PARQUET-2252: - rdblue commented on code in PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1125738673 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int blockIndex) throws IOException { } RowRanges rowRanges = getRowRanges(blockIndex); +return readFilteredRowGroup(blockIndex, rowRanges); + } + + /** + * Reads all the columns requested from the specified row group. It may skip specific pages based on the + * {@code rowRanges} passed in. As the rows are not aligned among the pages of the different columns row + * synchronization might be required. See the documentation of the class SynchronizingColumnReader for details. + * + * @param blockIndex the index of the requested block + * @param rowRanges the row ranges to be read from the requested block + * @return the PageReadStore which can provide PageReaders for each column or null if there are no rows in this block + * @throws IOException if an error occurs while reading + * @throws IllegalArgumentException if the {@code blockIndex} is invalid or the {@code rowRanges} is null + */ + public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, RowRanges rowRanges) throws IOException { +if (blockIndex < 0 || blockIndex >= blocks.size()) { + throw new IllegalArgumentException(String.format("Invalid block index %s, the valid block index range are: " + +"[%s, %s]", blockIndex, 0, blocks.size() - 1)); +} + +if (Objects.isNull(rowRanges)) { + throw new IllegalArgumentException("RowRanges must not be null"); +} + +BlockMetaData block = blocks.get(blockIndex); +if (block.getRowCount() == 0L) { + throw new ParquetEmptyBlockException("Illegal row group of 0 rows"); Review Comment: I don't see why this would throw an exception. This method is intended to allow building an external iterator. I don't think anyone would ever want to fail if there were an empty row group, even if the reader thinks it shouldn't have been written. I think this should return null. > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695732#comment-17695732 ] ASF GitHub Bot commented on PARQUET-2252: - wgtmac commented on code in PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1123155108 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int blockIndex) throws IOException { } RowRanges rowRanges = getRowRanges(blockIndex); +return readFilteredRowGroup(blockIndex, rowRanges); + } + + /** + * Reads all the columns requested from the specified row group. It may skip specific pages based on the + * {@code rowRanges} passed in. As the rows are not aligned among the pages of the different columns row + * synchronization might be required. See the documentation of the class SynchronizingColumnReader for details. + * + * @param blockIndex the index of the requested block + * @param rowRanges the row ranges to be read from the requested block + * @return the PageReadStore which can provide PageReaders for each column or null if there are no rows in this block + * @throws IOException if an error occurs while reading + * @throws IllegalArgumentException if the {@code blockIndex} is invalid or the {@code rowRanges} is null + */ + public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, RowRanges rowRanges) throws IOException { +if (blockIndex < 0 || blockIndex >= blocks.size()) { + throw new IllegalArgumentException(String.format("Invalid block index %s, the valid block index range are: " + +"[%s, %s]", blockIndex, 0, blocks.size() - 1)); +} + +if (Objects.isNull(rowRanges)) { + throw new IllegalArgumentException("RowRanges must not be null"); +} + +BlockMetaData block = blocks.get(blockIndex); +if (block.getRowCount() == 0L) { + throw new ParquetEmptyBlockException("Illegal row group of 0 rows"); Review Comment: That makes sense. > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695692#comment-17695692 ] ASF GitHub Bot commented on PARQUET-2252: - zhongyujiang commented on code in PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1123028615 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int blockIndex) throws IOException { } RowRanges rowRanges = getRowRanges(blockIndex); +return readFilteredRowGroup(blockIndex, rowRanges); + } + + /** + * Reads all the columns requested from the specified row group. It may skip specific pages based on the + * {@code rowRanges} passed in. As the rows are not aligned among the pages of the different columns row + * synchronization might be required. See the documentation of the class SynchronizingColumnReader for details. + * + * @param blockIndex the index of the requested block + * @param rowRanges the row ranges to be read from the requested block + * @return the PageReadStore which can provide PageReaders for each column or null if there are no rows in this block + * @throws IOException if an error occurs while reading + * @throws IllegalArgumentException if the {@code blockIndex} is invalid or the {@code rowRanges} is null + */ + public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, RowRanges rowRanges) throws IOException { +if (blockIndex < 0 || blockIndex >= blocks.size()) { + throw new IllegalArgumentException(String.format("Invalid block index %s, the valid block index range are: " + +"[%s, %s]", blockIndex, 0, blocks.size() - 1)); +} + +if (Objects.isNull(rowRanges)) { + throw new IllegalArgumentException("RowRanges must not be null"); +} + +BlockMetaData block = blocks.get(blockIndex); +if (block.getRowCount() == 0L) { + throw new ParquetEmptyBlockException("Illegal row group of 0 rows"); Review Comment: I checked PARQUET-2291, seems we only skip empty row group when using reader as a iterator, right? We skip empty row group in `readNextRowGroup()` , but not when the user passes in a `blockIndex`, and this newly introduced method also requires the user pass in a `blockIndex`. > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695407#comment-17695407 ] ASF GitHub Bot commented on PARQUET-2252: - wgtmac commented on code in PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1122511179 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int blockIndex) throws IOException { } RowRanges rowRanges = getRowRanges(blockIndex); +return readFilteredRowGroup(blockIndex, rowRanges); + } + + /** + * Reads all the columns requested from the specified row group. It may skip specific pages based on the + * {@code rowRanges} passed in. As the rows are not aligned among the pages of the different columns row + * synchronization might be required. See the documentation of the class SynchronizingColumnReader for details. + * + * @param blockIndex the index of the requested block + * @param rowRanges the row ranges to be read from the requested block + * @return the PageReadStore which can provide PageReaders for each column or null if there are no rows in this block + * @throws IOException if an error occurs while reading + * @throws IllegalArgumentException if the {@code blockIndex} is invalid or the {@code rowRanges} is null + */ + public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, RowRanges rowRanges) throws IOException { +if (blockIndex < 0 || blockIndex >= blocks.size()) { + throw new IllegalArgumentException(String.format("Invalid block index %s, the valid block index range are: " + +"[%s, %s]", blockIndex, 0, blocks.size() - 1)); +} + +if (Objects.isNull(rowRanges)) { + throw new IllegalArgumentException("RowRanges must not be null"); +} + +BlockMetaData block = blocks.get(blockIndex); +if (block.getRowCount() == 0L) { + throw new ParquetEmptyBlockException("Illegal row group of 0 rows"); Review Comment: Now the reader simply skips empty row groups instead of throw. Could you change this to be consistent? > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695180#comment-17695180 ] ASF GitHub Bot commented on PARQUET-2252: - gszadovszky commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450409997 Since these are already used in iceberg I think it is better to have them public and maintain backward compatibility. > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695166#comment-17695166 ] ASF GitHub Bot commented on PARQUET-2252: - wgtmac commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450359195 @gszadovszky @shangxinli Do you have any concern? > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695023#comment-17695023 ] ASF GitHub Bot commented on PARQUET-2252: - zhongyujiang commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1449991193 @wgtmac @rdblue Can you please help review this? > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695022#comment-17695022 ] ASF GitHub Bot commented on PARQUET-2252: - zhongyujiang opened a new pull request, #1038: URL: https://github.com/apache/parquet-mr/pull/1038 …implement page skipping. Issue: [PARQUET-2252](https://issues.apache.org/jira/browse/PARQUET-2252) This PR makes some methods required to implement column index filter public to allow Iceberg build its own column index filtering. Since Iceberg is going to calculate `RowRanges` itself, this also adds a public method in `ParquetFileReader` that allows users to pass in `RowRanges` to read filtered row group. Use of these changes can refer to this [PR](https://github.com/apache/iceberg/pull/6967), currently it uses reflection as a workaround. > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)