Copilot commented on code in PR #2616:
URL: https://github.com/apache/orc/pull/2616#discussion_r3198774795
##########
c++/src/Reader.cc:
##########
@@ -1792,12 +1792,33 @@ namespace orc {
std::vector<bool> selectedColumns;
columnSelector.updateSelected(selectedColumns, rowReaderOptions);
+ std::vector<std::pair<uint64_t, uint64_t>> ranges;
+
for (auto stripe : newStripes) {
const auto& stripeInfo = footer_->stripes(stripe);
proto::StripeFooter stripeFooter = getStripeFooter(stripeInfo,
*contents_);
- auto ranges = extractReadRangesForStripe(stripe, stripeInfo,
stripeFooter, selectedColumns);
- contents_->cacheRanges(std::move(ranges));
+ auto stripeRanges =
+ extractReadRangesForStripe(stripe, stripeInfo, stripeFooter,
selectedColumns);
+ for (const auto& range : stripeRanges) {
+ ranges.emplace_back(range.offset, range.length);
+ }
+ }
+ return ranges;
+ }
+
+ void ReaderImpl::preBuffer(const std::vector<uint32_t>& stripes,
+ const std::list<uint64_t>& includeTypes) {
+ auto ranges = preBufferRange(stripes, includeTypes);
+ if (ranges.empty()) {
+ return;
+ }
+
+ std::vector<ReadRange> readRanges;
+ readRanges.reserve(ranges.size());
+ for (const auto& range : ranges) {
+ readRanges.emplace_back(range.first, range.second);
}
+ contents_->cacheRanges(std::move(readRanges));
Review Comment:
The refactor introduces an extra conversion step (`pair<uint64_t,uint64_t>`
→ `ReadRange`) for the existing `preBuffer()` path. To keep `preBufferRange()`
as the public API while avoiding duplicate representations internally, consider
factoring out a shared internal helper that produces `std::vector<ReadRange>`
(the native type you already consume), and have `preBufferRange()` convert to
pairs only at the API boundary. This reduces representation churn and keeps
caching semantics closer to the previous implementation.
##########
c++/src/Reader.cc:
##########
@@ -1792,12 +1792,33 @@ namespace orc {
std::vector<bool> selectedColumns;
columnSelector.updateSelected(selectedColumns, rowReaderOptions);
+ std::vector<std::pair<uint64_t, uint64_t>> ranges;
+
for (auto stripe : newStripes) {
const auto& stripeInfo = footer_->stripes(stripe);
proto::StripeFooter stripeFooter = getStripeFooter(stripeInfo,
*contents_);
- auto ranges = extractReadRangesForStripe(stripe, stripeInfo,
stripeFooter, selectedColumns);
- contents_->cacheRanges(std::move(ranges));
+ auto stripeRanges =
+ extractReadRangesForStripe(stripe, stripeInfo, stripeFooter,
selectedColumns);
+ for (const auto& range : stripeRanges) {
+ ranges.emplace_back(range.offset, range.length);
+ }
+ }
+ return ranges;
Review Comment:
`preBufferRange()` returns ranges in an order that depends on the
caller-provided `stripes` ordering. Since this API is intended to let external
users “coordinate prefetching and avoid duplicated work”, the
ordering/normalization should be part of the contract (e.g., sorted by offset,
and ideally coalesced/merged where possible). Consider sorting by offset (and
documenting that ordering) before returning; this also makes downstream
caching/prefetching behavior more predictable and may prevent consumers or
`cacheRanges()` from needing to handle arbitrary ordering.
##########
c++/include/orc/Reader.hh:
##########
@@ -696,6 +697,16 @@ namespace orc {
virtual void preBuffer(const std::vector<uint32_t>& stripes,
const std::list<uint64_t>& includeTypes) = 0;
+ /**
+ * Calculate prefetch ranges by selected stripes and columns.
+ * It is thread safe and does not cache data.
+ * @param stripes the stripes to prefetch
+ * @param includeTypes the types to prefetch
+ * @return prefetch ranges as offset/length pairs
+ */
+ virtual std::vector<std::pair<uint64_t, uint64_t>> preBufferRange(
+ const std::vector<uint32_t>& stripes, const std::list<uint64_t>&
includeTypes) = 0;
Review Comment:
The doc comment states “It is thread safe”, but this method performs IO and
reads shared `FileContents` state (e.g., via `getStripeFooter(...)`). Unless
thread-safety is guaranteed end-to-end by the underlying stream/contents
implementation, this claim is misleading for API consumers. Either (a) remove
the “thread safe” guarantee, or (b) document the precise thread-safety contract
(what is safe to call concurrently and under which shared resources), and
ensure the implementation enforces it (e.g., by using thread-safe positional
reads or synchronization around shared stream access).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]