cbb330 commented on PR #49289: URL: https://github.com/apache/arrow/pull/49289#issuecomment-3932077146
### Code review Found 1 issue: 1. File opened twice without metadata caching — `MakeFragment(source, ..., stripe_ids)` opens the ORC file via `OpenORCReader(source)` to validate stripe IDs (line 334), then discards the reader. Later, `OrcScanTask::Impl::Make` opens it again (line 76) for actual scanning. For remote or cloud-backed files this doubles I/O cost. The Parquet implementation avoids this by caching file metadata inside the fragment (`EnsureCompleteMetadata` / `SetMetadata`) so that subsequent scans reuse it. Consider caching the reader or its metadata in the fragment, or deferring validation to scan time. https://github.com/apache/arrow/blob/6f9c86973ca60fce7f8092ef7aaf72157a1715e5/cpp/src/arrow/dataset/file_orc.cc#L330-L348 🤖 Generated with [Claude Code](https://claude.ai/code) <sub>- If this code review was useful, please react with 👍. Otherwise, react with 👎.</sub> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
