[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697187#comment-17697187
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

rdblue commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1457264814

   Thanks, @zhongyujiang!




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697186#comment-17697186
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

rdblue merged PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696677#comment-17696677
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

zhongyujiang commented on code in PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1125864270


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int 
blockIndex) throws IOException {
 }
 
 RowRanges rowRanges = getRowRanges(blockIndex);
+return readFilteredRowGroup(blockIndex, rowRanges);
+  }
+
+  /**
+   * Reads all the columns requested from the specified row group. It may skip 
specific pages based on the
+   * {@code rowRanges} passed in. As the rows are not aligned among the pages 
of the different columns row
+   * synchronization might be required. See the documentation of the class 
SynchronizingColumnReader for details.
+   *
+   * @param blockIndex the index of the requested block
+   * @param rowRanges the row ranges to be read from the requested block
+   * @return the PageReadStore which can provide PageReaders for each column 
or null if there are no rows in this block
+   * @throws IOException if an error occurs while reading
+   * @throws IllegalArgumentException if the {@code blockIndex} is invalid or 
the {@code rowRanges} is null
+   */
+  public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, 
RowRanges rowRanges) throws IOException {
+if (blockIndex < 0 || blockIndex >= blocks.size()) {
+  throw new IllegalArgumentException(String.format("Invalid block index 
%s, the valid block index range are: " +
+"[%s, %s]", blockIndex, 0, blocks.size() - 1));
+}
+
+if (Objects.isNull(rowRanges)) {
+  throw new IllegalArgumentException("RowRanges must not be null");
+}
+
+BlockMetaData block = blocks.get(blockIndex);
+if (block.getRowCount() == 0L) {
+  throw new ParquetEmptyBlockException("Illegal row group of 0 rows");

Review Comment:
   Currently, `readRowGroup(int blockIndex)` and `readFilteredRowGroup(int 
blockIndex)` also throw an exception when handling empty row groups, should we 
make them also return null instead of throwing an exception to be consistent? 





> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696626#comment-17696626
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

rdblue commented on code in PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1125738673


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int 
blockIndex) throws IOException {
 }
 
 RowRanges rowRanges = getRowRanges(blockIndex);
+return readFilteredRowGroup(blockIndex, rowRanges);
+  }
+
+  /**
+   * Reads all the columns requested from the specified row group. It may skip 
specific pages based on the
+   * {@code rowRanges} passed in. As the rows are not aligned among the pages 
of the different columns row
+   * synchronization might be required. See the documentation of the class 
SynchronizingColumnReader for details.
+   *
+   * @param blockIndex the index of the requested block
+   * @param rowRanges the row ranges to be read from the requested block
+   * @return the PageReadStore which can provide PageReaders for each column 
or null if there are no rows in this block
+   * @throws IOException if an error occurs while reading
+   * @throws IllegalArgumentException if the {@code blockIndex} is invalid or 
the {@code rowRanges} is null
+   */
+  public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, 
RowRanges rowRanges) throws IOException {
+if (blockIndex < 0 || blockIndex >= blocks.size()) {
+  throw new IllegalArgumentException(String.format("Invalid block index 
%s, the valid block index range are: " +
+"[%s, %s]", blockIndex, 0, blocks.size() - 1));
+}
+
+if (Objects.isNull(rowRanges)) {
+  throw new IllegalArgumentException("RowRanges must not be null");
+}
+
+BlockMetaData block = blocks.get(blockIndex);
+if (block.getRowCount() == 0L) {
+  throw new ParquetEmptyBlockException("Illegal row group of 0 rows");

Review Comment:
   I don't see why this would throw an exception. This method is intended to 
allow building an external iterator. I don't think anyone would ever want to 
fail if there were an empty row group, even if the reader thinks it shouldn't 
have been written. I think this should return null.





> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695732#comment-17695732
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

wgtmac commented on code in PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1123155108


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int 
blockIndex) throws IOException {
 }
 
 RowRanges rowRanges = getRowRanges(blockIndex);
+return readFilteredRowGroup(blockIndex, rowRanges);
+  }
+
+  /**
+   * Reads all the columns requested from the specified row group. It may skip 
specific pages based on the
+   * {@code rowRanges} passed in. As the rows are not aligned among the pages 
of the different columns row
+   * synchronization might be required. See the documentation of the class 
SynchronizingColumnReader for details.
+   *
+   * @param blockIndex the index of the requested block
+   * @param rowRanges the row ranges to be read from the requested block
+   * @return the PageReadStore which can provide PageReaders for each column 
or null if there are no rows in this block
+   * @throws IOException if an error occurs while reading
+   * @throws IllegalArgumentException if the {@code blockIndex} is invalid or 
the {@code rowRanges} is null
+   */
+  public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, 
RowRanges rowRanges) throws IOException {
+if (blockIndex < 0 || blockIndex >= blocks.size()) {
+  throw new IllegalArgumentException(String.format("Invalid block index 
%s, the valid block index range are: " +
+"[%s, %s]", blockIndex, 0, blocks.size() - 1));
+}
+
+if (Objects.isNull(rowRanges)) {
+  throw new IllegalArgumentException("RowRanges must not be null");
+}
+
+BlockMetaData block = blocks.get(blockIndex);
+if (block.getRowCount() == 0L) {
+  throw new ParquetEmptyBlockException("Illegal row group of 0 rows");

Review Comment:
   That makes sense.





> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695692#comment-17695692
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

zhongyujiang commented on code in PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1123028615


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int 
blockIndex) throws IOException {
 }
 
 RowRanges rowRanges = getRowRanges(blockIndex);
+return readFilteredRowGroup(blockIndex, rowRanges);
+  }
+
+  /**
+   * Reads all the columns requested from the specified row group. It may skip 
specific pages based on the
+   * {@code rowRanges} passed in. As the rows are not aligned among the pages 
of the different columns row
+   * synchronization might be required. See the documentation of the class 
SynchronizingColumnReader for details.
+   *
+   * @param blockIndex the index of the requested block
+   * @param rowRanges the row ranges to be read from the requested block
+   * @return the PageReadStore which can provide PageReaders for each column 
or null if there are no rows in this block
+   * @throws IOException if an error occurs while reading
+   * @throws IllegalArgumentException if the {@code blockIndex} is invalid or 
the {@code rowRanges} is null
+   */
+  public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, 
RowRanges rowRanges) throws IOException {
+if (blockIndex < 0 || blockIndex >= blocks.size()) {
+  throw new IllegalArgumentException(String.format("Invalid block index 
%s, the valid block index range are: " +
+"[%s, %s]", blockIndex, 0, blocks.size() - 1));
+}
+
+if (Objects.isNull(rowRanges)) {
+  throw new IllegalArgumentException("RowRanges must not be null");
+}
+
+BlockMetaData block = blocks.get(blockIndex);
+if (block.getRowCount() == 0L) {
+  throw new ParquetEmptyBlockException("Illegal row group of 0 rows");

Review Comment:
   I checked PARQUET-2291, seems we only skip empty row group when using reader 
as a iterator, right? We skip empty row group in `readNextRowGroup()` , but not 
when the user passes in a `blockIndex`, and  this newly introduced method also 
requires the user pass in a `blockIndex`.
   





> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695407#comment-17695407
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

wgtmac commented on code in PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1122511179


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int 
blockIndex) throws IOException {
 }
 
 RowRanges rowRanges = getRowRanges(blockIndex);
+return readFilteredRowGroup(blockIndex, rowRanges);
+  }
+
+  /**
+   * Reads all the columns requested from the specified row group. It may skip 
specific pages based on the
+   * {@code rowRanges} passed in. As the rows are not aligned among the pages 
of the different columns row
+   * synchronization might be required. See the documentation of the class 
SynchronizingColumnReader for details.
+   *
+   * @param blockIndex the index of the requested block
+   * @param rowRanges the row ranges to be read from the requested block
+   * @return the PageReadStore which can provide PageReaders for each column 
or null if there are no rows in this block
+   * @throws IOException if an error occurs while reading
+   * @throws IllegalArgumentException if the {@code blockIndex} is invalid or 
the {@code rowRanges} is null
+   */
+  public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, 
RowRanges rowRanges) throws IOException {
+if (blockIndex < 0 || blockIndex >= blocks.size()) {
+  throw new IllegalArgumentException(String.format("Invalid block index 
%s, the valid block index range are: " +
+"[%s, %s]", blockIndex, 0, blocks.size() - 1));
+}
+
+if (Objects.isNull(rowRanges)) {
+  throw new IllegalArgumentException("RowRanges must not be null");
+}
+
+BlockMetaData block = blocks.get(blockIndex);
+if (block.getRowCount() == 0L) {
+  throw new ParquetEmptyBlockException("Illegal row group of 0 rows");

Review Comment:
   Now the reader simply skips empty row groups instead of throw. Could you 
change this to be consistent?





> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695180#comment-17695180
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

gszadovszky commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450409997

   Since these are already used in iceberg I think it is better to have them 
public and maintain backward compatibility. 




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695166#comment-17695166
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

wgtmac commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450359195

   @gszadovszky @shangxinli Do you have any concern?




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695023#comment-17695023
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

zhongyujiang commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1449991193

   @wgtmac @rdblue Can you please help review this?




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695022#comment-17695022
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

zhongyujiang opened a new pull request, #1038:
URL: https://github.com/apache/parquet-mr/pull/1038

   …implement page skipping.
   
   Issue: [PARQUET-2252](https://issues.apache.org/jira/browse/PARQUET-2252)
   
   This PR makes some methods required to implement column index filter public 
to allow Iceberg build its own column index filtering.  Since Iceberg is going 
to calculate `RowRanges` itself, this also adds a public method in 
`ParquetFileReader` that allows users to pass in `RowRanges` to read filtered 
row group. Use of these changes can refer to this 
[PR](https://github.com/apache/iceberg/pull/6967), currently it uses reflection 
as a workaround.




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)