Re: [PR] [C++] Expose prefetch range planning via Reader::preBufferRange and refactor preBuffer to reuse it [orc]

via GitHub Wed, 06 May 2026 20:25:25 -0700


Copilot commented on code in PR #2616:
URL: https://github.com/apache/orc/pull/2616#discussion_r3198774795



##########
c++/src/Reader.cc:
##########
@@ -1792,12 +1792,33 @@ namespace orc {
     std::vector<bool> selectedColumns;
     columnSelector.updateSelected(selectedColumns, rowReaderOptions);
 
+    std::vector<std::pair<uint64_t, uint64_t>> ranges;
+
     for (auto stripe : newStripes) {
       const auto& stripeInfo = footer_->stripes(stripe);
       proto::StripeFooter stripeFooter = getStripeFooter(stripeInfo, 
*contents_);
-      auto ranges = extractReadRangesForStripe(stripe, stripeInfo, 
stripeFooter, selectedColumns);
-      contents_->cacheRanges(std::move(ranges));
+      auto stripeRanges =
+          extractReadRangesForStripe(stripe, stripeInfo, stripeFooter, 
selectedColumns);
+      for (const auto& range : stripeRanges) {
+        ranges.emplace_back(range.offset, range.length);
+      }
+    }
+    return ranges;
+  }
+
+  void ReaderImpl::preBuffer(const std::vector<uint32_t>& stripes,
+                             const std::list<uint64_t>& includeTypes) {
+    auto ranges = preBufferRange(stripes, includeTypes);
+    if (ranges.empty()) {
+      return;
+    }
+
+    std::vector<ReadRange> readRanges;
+    readRanges.reserve(ranges.size());
+    for (const auto& range : ranges) {
+      readRanges.emplace_back(range.first, range.second);
     }
+    contents_->cacheRanges(std::move(readRanges));

Review Comment:
   The refactor introduces an extra conversion step (`pair<uint64_t,uint64_t>` 
→ `ReadRange`) for the existing `preBuffer()` path. To keep `preBufferRange()` 
as the public API while avoiding duplicate representations internally, consider 
factoring out a shared internal helper that produces `std::vector<ReadRange>` 
(the native type you already consume), and have `preBufferRange()` convert to 
pairs only at the API boundary. This reduces representation churn and keeps 
caching semantics closer to the previous implementation.



##########
c++/src/Reader.cc:
##########
@@ -1792,12 +1792,33 @@ namespace orc {
     std::vector<bool> selectedColumns;
     columnSelector.updateSelected(selectedColumns, rowReaderOptions);
 
+    std::vector<std::pair<uint64_t, uint64_t>> ranges;
+
     for (auto stripe : newStripes) {
       const auto& stripeInfo = footer_->stripes(stripe);
       proto::StripeFooter stripeFooter = getStripeFooter(stripeInfo, 
*contents_);
-      auto ranges = extractReadRangesForStripe(stripe, stripeInfo, 
stripeFooter, selectedColumns);
-      contents_->cacheRanges(std::move(ranges));
+      auto stripeRanges =
+          extractReadRangesForStripe(stripe, stripeInfo, stripeFooter, 
selectedColumns);
+      for (const auto& range : stripeRanges) {
+        ranges.emplace_back(range.offset, range.length);
+      }
+    }
+    return ranges;

Review Comment:
   `preBufferRange()` returns ranges in an order that depends on the 
caller-provided `stripes` ordering. Since this API is intended to let external 
users “coordinate prefetching and avoid duplicated work”, the 
ordering/normalization should be part of the contract (e.g., sorted by offset, 
and ideally coalesced/merged where possible). Consider sorting by offset (and 
documenting that ordering) before returning; this also makes downstream 
caching/prefetching behavior more predictable and may prevent consumers or 
`cacheRanges()` from needing to handle arbitrary ordering.
   



##########
c++/include/orc/Reader.hh:
##########
@@ -696,6 +697,16 @@ namespace orc {
     virtual void preBuffer(const std::vector<uint32_t>& stripes,
                            const std::list<uint64_t>& includeTypes) = 0;
 
+    /**
+     * Calculate prefetch ranges by selected stripes and columns.
+     * It is thread safe and does not cache data.
+     * @param stripes the stripes to prefetch
+     * @param includeTypes the types to prefetch
+     * @return prefetch ranges as offset/length pairs
+     */
+    virtual std::vector<std::pair<uint64_t, uint64_t>> preBufferRange(
+        const std::vector<uint32_t>& stripes, const std::list<uint64_t>& 
includeTypes) = 0;

Review Comment:
   The doc comment states “It is thread safe”, but this method performs IO and 
reads shared `FileContents` state (e.g., via `getStripeFooter(...)`). Unless 
thread-safety is guaranteed end-to-end by the underlying stream/contents 
implementation, this claim is misleading for API consumers. Either (a) remove 
the “thread safe” guarantee, or (b) document the precise thread-safety contract 
(what is safe to call concurrently and under which shared resources), and 
ensure the implementation enforces it (e.g., by using thread-safe positional 
reads or synchronization around shared stream access).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [C++] Expose prefetch range planning via Reader::preBufferRange and refactor preBuffer to reuse it [orc]

Reply via email to