cbb330 commented on PR #49289:
URL: https://github.com/apache/arrow/pull/49289#issuecomment-3932077146

   ### Code review
   
   Found 1 issue:
   
   1. File opened twice without metadata caching — `MakeFragment(source, ..., 
stripe_ids)` opens the ORC file via `OpenORCReader(source)` to validate stripe 
IDs (line 334), then discards the reader. Later, `OrcScanTask::Impl::Make` 
opens it again (line 76) for actual scanning. For remote or cloud-backed files 
this doubles I/O cost. The Parquet implementation avoids this by caching file 
metadata inside the fragment (`EnsureCompleteMetadata` / `SetMetadata`) so that 
subsequent scans reuse it. Consider caching the reader or its metadata in the 
fragment, or deferring validation to scan time.
   
   
https://github.com/apache/arrow/blob/6f9c86973ca60fce7f8092ef7aaf72157a1715e5/cpp/src/arrow/dataset/file_orc.cc#L330-L348
   
   🤖 Generated with [Claude Code](https://claude.ai/code)
   
   <sub>- If this code review was useful, please react with 👍. Otherwise, react 
with 👎.</sub>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to