ShreyeshArangath opened a new issue, #49288: URL: https://github.com/apache/arrow/issues/49288
### Describe the enhancement requested ### Summary The ORC dataset integration currently lacks stripe-level subsetting support. When scanning ORC files through the Dataset API, there is no way to select specific stripes. The entire file is always read. This is a gap compared to ParquetFileFragment, which provides row-group-level subsetting via `Subset()`, `row_groups()`, and `MakeFragment(..., row_groups)`. ### Details Modeled after the `ParquetFileFragment` design, we introduce stripe-aware ORC fragments so callers can target specific stripes during planning and scanning (instead of always reading the full file). This adds a small, consistent surface area in both C++ (and Python, separate issue): * An ORC-specific fragment type that can represent either the full file, or a subset of the file defined by stripe IDs * Fragment subsetting via a `subset(...)`/`Subset(...)` API, analogous to Parquet row-group subsetting. * Scan behavior that honors stripe selection, so execution reads only the requested stripes. * Correct row counting for subset fragments, where row counts reflect only the selected stripes ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
