singhpk234 commented on code in PR #7279:
URL: https://github.com/apache/iceberg/pull/7279#discussion_r1164529427
##########
parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java:
##########
@@ -154,21 +177,47 @@ public T next() {
}
private void advance() {
- while (shouldSkip[nextRowGroup]) {
- nextRowGroup += 1;
- reader.skipNextRowGroup();
- }
- PageReadStore pages;
try {
- pages = reader.readNextRowGroup();
- } catch (IOException e) {
- throw new RuntimeIOException(e);
+ Preconditions.checkNotNull(prefetchRowGroupFuture, "future should not
be null");
+ PageReadStore pages = prefetchRowGroupFuture.get();
+
+ if (prefetchedRowGroup >= totalRowGroups) {
+ return;
+ }
+ Preconditions.checkState(
+ pages != null,
+ "advance() should have been only when there was at least one row
group to read");
+ long rowPosition = rowGroupsStartRowPos[prefetchedRowGroup];
+ model.setRowGroupInfo(pages,
columnChunkMetadata.get(prefetchedRowGroup), rowPosition);
+ nextRowGroupStart += pages.getRowCount();
+ prefetchedRowGroup += 1;
+ prefetchNextRowGroup(); // eagerly fetch the next row group
Review Comment:
Was testing this with concurrent reads, looks like reader.readRowGroup(int
blockIndex) is not thread safe, it requires the SeekableInputStream that
parquet reader is holding to be seeked to the given offset first, since this is
a class variable it was causing correctness issue.
One possible solution is to have a pool of ParquetFileReaders and get one
reader from the pool, ask it to get the desired row-group and then when reading
all the row-groups is done close the pool. Thinking of prototyping this change
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]