saadtajwar opened a new pull request, #22962: URL: https://github.com/apache/datafusion/pull/22962
## Which issue does this PR close? - Closes #21419 ## Rationale for this change The CSV scanner currently uses `calculate_range` which issues two extra `get_opts` requests per byte range to find newline boundaries (one for the start boundary, one for the end boundary), plus one GET for the actual data. For a file split into 3 partitions, this results in 8 total object store requests. #20823 solved this same problem for the JSON scanner by introducing `AlignedBoundaryStream`, which wraps the raw byte stream and lazily aligns to newline boundaries as data is read, eliminating the extra boundary-seeking requests entirely. This PR applies the same approach to CSV. ## What changes are included in this PR? Based on the approach from #20823: Moved `AlignedBoundaryStream`** from `datasource-json` to the shared `datasource` crate so it can be reused by both JSON and CSV scanners. Updated `CsvOpener` to use instead of `calculate_range`, and removed the `calculate_range` & `find_first_newline` as they no longer had any callers. Updated tests to reflect. Note: `RangeCalculation` is left in place as it is a public API item, even though it no longer has any consumers. ## Are these changes tested? Yes. The existing `AlignedBoundaryStream` unit tests (16 tests covering boundary alignment edge cases) were moved along with the implementation and continue to pass. The `query_csv_file_with_byte_range_partitions` snapshot test in `object_store_access.rs` has been updated to verify the new request pattern (4 requests instead of 8). ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
