cbb330 opened a new pull request, #49379:
URL: https://github.com/apache/arrow/pull/49379
### Rationale
Part 1 of ORC predicate pushdown (#48986).
Add public APIs to `ORCFileReader` for accessing stripe-level and file-level
column statistics as Arrow types, stripe-selective reading, and Arrow-to-ORC
schema mapping. This is the foundation that the dataset layer will consume for
predicate evaluation. Heavy inspiration drawn from the Parquet predicate
pushdown implementation.
### What changes are included in this PR?
New structs in `adapter.h`:
```cpp
/// Column statistics from an ORC file, with min/max as Arrow scalars.
struct OrcColumnStatistics {
bool has_null;
int64_t num_values;
bool has_min_max;
std::shared_ptr<Scalar> min; // Arrow scalar (nullptr if not available)
std::shared_ptr<Scalar> max; // Arrow scalar (nullptr if not available)
};
```
The struct exists because liborc's statistics API returns type-erased
`ColumnStatistics*` pointers with different C++ return types per subclass
(`int64_t`, `double`, `std::string`, `Decimal`), and the dataset layer needs a
single uniform type it can pass around without knowing the ORC column type —
`std::shared_ptr<Scalar>` provides that.
New structs in `util.h`:
```cpp
/// Maps an Arrow field to its ORC physical column ID.
/// ORC uses depth-first pre-order numbering (column 0 = root struct).
struct OrcSchemaField {
std::shared_ptr<Field> field;
int orc_column_id;
std::vector<OrcSchemaField> children; // for struct, list, map
bool is_leaf() const;
};
/// Maps an entire Arrow schema to ORC column IDs.
struct OrcSchemaManifest {
std::vector<OrcSchemaField> schema_fields; // parallel to Arrow schema
fields
const OrcSchemaField* GetField(const std::vector<int>& path) const;
};
```
The dataset layer needs to translate Arrow field references from filter
expressions into ORC column indices for statistics lookup.
`BuildSchemaManifest()` walks the paired ORC and Arrow type trees to build this
mapping, handling nested types (struct, list, map) recursively.
New methods on `ORCFileReader`:
```cpp
class ORCFileReader {
public:
// --- existing API (unchanged) ---
Result<std::shared_ptr<Table>> Read();
Result<std::shared_ptr<RecordBatch>> ReadStripe(int64_t stripe);
Result<std::shared_ptr<RecordBatch>> ReadStripe(int64_t stripe,
const std::vector<int>&
include_indices);
int64_t NumberOfStripes();
StripeInformation GetStripeInformation(int64_t stripe);
// --- NEW: stripe-selective reading ---
// Read only selected stripes, concatenated into a single Table.
// Used by the dataset layer after predicate pushdown eliminates stripes.
Result<std::shared_ptr<Table>> ReadStripes(
const std::vector<int64_t>& stripe_indices);
// NEW
Result<std::shared_ptr<Table>> ReadStripes(
const std::vector<int64_t>& stripe_indices,
const std::vector<int>& include_indices);
// NEW
// --- NEW: column statistics ---
Result<OrcColumnStatistics> GetColumnStatistics(int column_index);
// NEW (file-level)
Result<OrcColumnStatistics> GetStripeColumnStatistics(int64_t stripe_index,
int column_index);
// NEW (stripe-level)
// Bulk variant — parses stripe statistics once instead of per-column.
Result<std::vector<OrcColumnStatistics>> GetStripeStatistics(
int64_t stripe_index, const std::vector<int>& column_indices);
// NEW
// --- NEW: schema manifest ---
// Builds a mapping from Arrow field paths to ORC column IDs by walking
// the paired ORC/Arrow type trees. Needed for statistics lookup.
Result<std::shared_ptr<OrcSchemaManifest>> BuildSchemaManifest(
const std::shared_ptr<Schema>& arrow_schema) const;
// NEW
};
```
For reference, the Parquet equivalent of ORCFileReader is
`parquet::arrow::FileReader` at `cpp/src/parquet/arrow/reader.h:116`.
ORC (ORCFileReader) | Parquet (parquet::arrow::FileReader)
-- | --
ReadStripe(i) | ReadRowGroup(i)
ReadStripes(indices) | ReadRowGroups(row_groups)
ReadStripes(indices, cols) | ReadRowGroups(row_groups, column_indices)
NumberOfStripes() | num_row_groups()
GetRecordBatchReader(batch_size, names) | GetRecordBatchReader(row_groups,
columns)
Read() | ReadTable()
BuildSchemaManifest(schema) | (parquet uses SchemaManifest in
parquet/arrow/schema.h)
New Internal Helper:
`ConvertColumnStatistics()` downcasts liborc `ColumnStatistics` to typed
subclasses and produces the appropriate Arrow scalar.
```
Result<OrcColumnStatistics> ConvertColumnStatistics(const
liborc::ColumnStatistics* s) {
OrcColumnStatistics out{s->hasNull(),
static_cast<int64_t>(s->getNumberOfValues()),
/*has_min_max=*/false, nullptr, nullptr};
if (auto* p = dynamic_cast<const liborc::IntegerColumnStatistics*>(s)) {
// BYTE, SHORT, INT, LONG → Int64Scalar
} else if (auto* p = dynamic_cast<const
liborc::DoubleColumnStatistics*>(s)) {
// FLOAT, DOUBLE → DoubleScalar (skip if NaN)
} else if (auto* p = dynamic_cast<const
liborc::StringColumnStatistics*>(s)) {
// STRING, VARCHAR, CHAR → StringScalar
} else if (auto* p = dynamic_cast<const
liborc::DateColumnStatistics*>(s)) {
// DATE → Date32Scalar
} else if (auto* p = dynamic_cast<const
liborc::TimestampColumnStatistics*>(s)) {
// millis * 1_000_000 + sub-millis nanos → TimestampScalar(NANO)
} else if (auto* p = dynamic_cast<const
liborc::DecimalColumnStatistics*>(s)) {
// ORC Int128 → Arrow Decimal128Scalar(precision=38, scale from ORC)
// Skip if min.scale != max.scale (corrupted stats)
}
// Boolean, Binary, Collection, etc. → no min/max (has_min_max stays
false)
return out;
}
```
The pattern is the same for every branch: `dynamic_cast` to typed subclass,
check `hasMinimum() && hasMaximum()`, wrap in the corresponding Arrow scalar.
The interesting branches are double (NaN guard), timestamp (two-part
millis+nanos conversion), and decimal (scale consistency check + Int128 bit
extraction).
### Are these changes tested?
Unit tests in `adapter_test.cc`:
*Statistics tests:*
- Integer column statistics (file-level and stripe-level)
- String column statistics
- Boolean column statistics (verifies no min/max via fallthrough)
- Date column statistics
- Timestamp column statistics
- Double with NaN values (NaN guard)
- Out-of-range column/stripe index → error
- Negative column index → error
- Columns with nulls
- Bulk stripe statistics (multiple columns at once, verified against
individual calls)
*Stripe-selective reading tests:*
- ReadStripes with multiple stripes
- ReadStripes with column selection
- ReadStripes with out-of-range stripe index → error
- ReadStripes with empty stripe indices → error
*Schema manifest tests:*
- BuildSchemaManifest maps Arrow field names to correct ORC column IDs
(column 0 = root struct, column 1 = first field, etc.)
- Verifies field count matches, leaf detection works
### Are there any user-facing changes?
No. These are new C++ APIs on `ORCFileReader` that are not yet exposed in
Python bindings.
### Reference

### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]