huberylee commented on code in PR #39393:
URL: https://github.com/apache/arrow/pull/39393#discussion_r1440385297
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1370,6 +1402,54 @@ class TypedRecordReader : public
TypedColumnReaderImpl<DType>,
return bytes_for_values;
}
+ // Two parts different from original HasNextInternal:
Review Comment:
> Yeah, I mean it limits the usage of `TypedColumnReader`, and only allow
internal skip. External skip would introduce inconsistent skipping. In
current-case, `Skip(skip_last)` would skip more than `skip_last`.
First of all, it is strange to execute ``SkipRecords`` on the basis of hit
lines. Secondly, whether it is consistent depends on how to understand the
semantics of ``SkipRecords``. If some lines are skipped on the basis of hit
lines, the current implementation can theoretically guarantee consistency, but
more tests need to be added for verification; If ``SkipRecords`` is for all
rows in page, then the existing implementation will indeed have problems.
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1370,6 +1402,54 @@ class TypedRecordReader : public
TypedColumnReaderImpl<DType>,
return bytes_for_values;
}
+ // Two parts different from original HasNextInternal:
Review Comment:
First of all, it is strange to execute ``SkipRecords`` on the basis of hit
lines. Secondly, whether it is consistent depends on how to understand the
semantics of ``SkipRecords``. If some lines are skipped on the basis of hit
lines, the current implementation can theoretically guarantee consistency, but
more tests need to be added for verification; If ``SkipRecords`` is for all
rows in page, then the existing implementation will indeed have problems.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]