mapleFU commented on code in PR #39393:
URL: https://github.com/apache/arrow/pull/39393#discussion_r1440340013


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1370,6 +1402,54 @@ class TypedRecordReader : public 
TypedColumnReaderImpl<DType>,
     return bytes_for_values;
   }
 
+  // Two parts different from original HasNextInternal:
+  //   1. When reading a new page, the corresponding skip info will be updated;
+  //   2. When current page has more data to read, it will check skip info
+  //   inside current page;
+  bool HasNextInternal() override {
+    // Either there is no data page available yet, or the data page has been
+    // exhausted
+    if (this->num_buffered_values_ == 0 ||
+        this->num_decoded_values_ == this->num_buffered_values_) {
+      if (!this->ReadNewPage() || this->num_buffered_values_ == 0) {
+        return false;
+      }
+
+      // Skip header rows and hide last rows for this page
+      auto read_states = this->pager_->GetPageReadStates();
+      if (read_states != nullptr) {
+        PageSkipInfo& skip_info = read_states->NextPageSkipInfo();
+        const int64_t records_to_skip = skip_info.SkipRowNum();
+        const int64_t skipped_records = SkipRecords(records_to_skip);

Review Comment:
   @jp0317 Should we combine `ShouldSkipPage` with PageIndex? Our current 
codebase relies something weird here. 🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to