ShreyeshArangath opened a new issue, #49288:
URL: https://github.com/apache/arrow/issues/49288

   ### Describe the enhancement requested
   
   ### Summary 
   The ORC dataset integration currently lacks stripe-level subsetting support. 
When scanning ORC files through the Dataset API, there is no way to select 
specific stripes. The entire file is always read. This is a gap compared to 
ParquetFileFragment, which provides row-group-level subsetting via `Subset()`, 
`row_groups()`, and `MakeFragment(..., row_groups)`.
   
   ### Details
   
   Modeled after the `ParquetFileFragment` design, we introduce stripe-aware 
ORC fragments so callers can target specific stripes during planning and 
scanning (instead of always reading the full file). This adds a small, 
consistent surface area in both C++ (and Python, separate issue):
   * An ORC-specific fragment type that can represent either the full file, or 
a subset of the file defined by stripe IDs
   * Fragment subsetting via a `subset(...)`/`Subset(...)` API, analogous to 
Parquet row-group subsetting.
   * Scan behavior that honors stripe selection, so execution reads only the 
requested stripes.
   * Correct row counting for subset fragments, where row counts reflect only 
the selected stripes
   
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to