rdblue commented on a change in pull request #1254:
URL: https://github.com/apache/iceberg/pull/1254#discussion_r462659192
##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java
##########
@@ -157,6 +164,23 @@ ParquetFileReader reader() {
return shouldSkip;
}
+ private Map<Long, Long> generateRowGroupsStartRowPos() {
Review comment:
You're right. Good catch!
Can you add some comments to explain why this is needed for later?
##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java
##########
@@ -157,6 +164,23 @@ ParquetFileReader reader() {
return shouldSkip;
}
+ private Map<Long, Long> generateRowGroupsStartRowPos() {
+ ParquetFileReader fileReader = newReader(this.file,
ParquetReadOptions.builder().build());
+ Map<Long, Long> offsetToStartRowPosMap = new HashMap<>();
+ long curRowCount = 0;
+ for (int i = 0; i < fileReader.getRowGroups().size(); i += 1) {
+ BlockMetaData meta = fileReader.getRowGroups().get(i);
+ offsetToStartRowPosMap.put(meta.getStartingPos(), curRowCount);
+ curRowCount += meta.getRowCount();
+ }
+
+ return offsetToStartRowPosMap;
+ }
+
+ long[] startRowPositions() {
+ return rowGroupsStartRowPos;
Review comment:
Can we use the same name for the variable?
##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java
##########
@@ -85,6 +87,9 @@
this.rowGroups = reader.getRowGroups();
this.shouldSkip = new boolean[rowGroups.size()];
+ Map<Long, Long> offsetToStartRowPosMap = generateRowGroupsStartRowPos();
Review comment:
How about naming this `offsetToStartPos` and similarly updating the
method name? There's no need to include a type in the variable name, usually.
##########
File path:
parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReader.java
##########
@@ -29,5 +29,5 @@
List<TripleIterator<?>> columns();
- void setPageSource(PageReadStore pageStore);
+ void setPageSource(PageReadStore pageStore, long rowPosition);
Review comment:
How about adding a default implementation that calls the old method?
That way older implementations are still supported.
```java
default void setPageSource(PageReadStore pageStore, long rowPosition) {
setPageSource(pageStore);
}
default setPageSource(PageReadStore pageStore) {
throw new UnsupportedOperationException("At least one setPageSource
method must be implemented");
}
```
##########
File path:
parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReader.java
##########
@@ -29,5 +29,5 @@
List<TripleIterator<?>> columns();
- void setPageSource(PageReadStore pageStore);
+ void setPageSource(PageReadStore pageStore, long rowPosition);
Review comment:
How about adding a default implementation that calls the old method?
That way older implementations are still supported.
```java
default void setPageSource(PageReadStore pageStore, long rowPosition) {
setPageSource(pageStore);
}
default void setPageSource(PageReadStore pageStore) {
throw new UnsupportedOperationException("At least one setPageSource
method must be implemented");
}
```
##########
File path:
parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReader.java
##########
@@ -29,5 +29,5 @@
List<TripleIterator<?>> columns();
- void setPageSource(PageReadStore pageStore);
+ void setPageSource(PageReadStore pageStore, long rowPosition);
Review comment:
Also, then we can remove the changes from many of the implementations
and cut down on the changes in this PR.
##########
File path:
parquet/src/main/java/org/apache/iceberg/parquet/BaseColumnIterator.java
##########
@@ -72,4 +73,7 @@ public boolean hasNext() {
return triplesRead < triplesCount;
}
+ public void setRowPosition(long rowPosition) {
Review comment:
Instead of adding this, can you update `setPageSource` like the other
interface that changed?
##########
File path:
parquet/src/main/java/org/apache/iceberg/parquet/BaseColumnIterator.java
##########
@@ -34,6 +34,7 @@
protected long triplesRead = 0L;
protected long advanceNextPageCount = 0L;
protected Dictionary dictionary;
+ protected long rowPosition;
Review comment:
Is this needed? I don't see any uses.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]