jp0317 commented on code in PR #39393:
URL: https://github.com/apache/arrow/pull/39393#discussion_r1440069238


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1370,6 +1402,54 @@ class TypedRecordReader : public 
TypedColumnReaderImpl<DType>,
     return bytes_for_values;
   }
 
+  // Two parts different from original HasNextInternal:
+  //   1. When reading a new page, the corresponding skip info will be updated;
+  //   2. When current page has more data to read, it will check skip info
+  //   inside current page;
+  bool HasNextInternal() override {
+    // Either there is no data page available yet, or the data page has been
+    // exhausted
+    if (this->num_buffered_values_ == 0 ||
+        this->num_decoded_values_ == this->num_buffered_values_) {
+      if (!this->ReadNewPage() || this->num_buffered_values_ == 0) {
+        return false;
+      }
+
+      // Skip header rows and hide last rows for this page
+      auto read_states = this->pager_->GetPageReadStates();
+      if (read_states != nullptr) {
+        PageSkipInfo& skip_info = read_states->NextPageSkipInfo();
+        const int64_t records_to_skip = skip_info.SkipRowNum();
+        const int64_t skipped_records = SkipRecords(records_to_skip);

Review Comment:
   the `ReadNextPage` may use data page filter. Assuming users set both filter 
and row ranges. Would `SkipRecords(records_to_skip)` skip wrong records?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to