andygrove opened a new pull request, #1615: URL: https://github.com/apache/datafusion-ballista/pull/1615
# Which issue does this PR close? Closes #. # Rationale for this change Sort-shuffle finalize previously decoded every spilled batch and re-emitted it through an IPC `FileWriter`, paying decompress + Arrow allocation + recompress for every spilled byte. This PR replaces that round-trip with a `std::io::copy` of the spill file straight into the consolidated output. On Linux this engages `copy_file_range` / `sendfile`, so spilled bytes never re-enter user space. # What changes are included in this PR? The on-disk format for sort-shuffle output changes: - **Data file**: was a single IPC File with a footer of batch-block offsets. Now it is a leading schema-header IPC stream followed by per-partition byte ranges, each holding zero or more concatenated self-contained IPC streams. - **Index file**: was little-endian i64 cumulative batch indices (despite the docstring already promising byte offsets). Now it stores actual little-endian i64 byte offsets, matching what the docstring always claimed. - **Reader**: `stream_sort_shuffle_partition` recovers the schema from the leading header stream and uses a new bounded multi-stream reader that crosses concatenated stream EOS markers within a partition's byte range. Hash-based shuffle is intentionally untouched. Public API of `is_sort_shuffle_output`, `get_index_path`, and `stream_sort_shuffle_partition` is unchanged, so `ShuffleReaderExec` and the executor's Arrow Flight service work without modification. New tests cover multi-spill, in-memory-only, and empty-partition round-trips. # Are there any user-facing changes? No public-API changes. The sort-shuffle on-disk format changes — it is executor-internal, but in-flight files written by older binaries are not readable by this version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
