EnricoMi opened a new pull request, #54268:
URL: https://github.com/apache/spark/pull/54268

   ### What changes were proposed in this pull request?
   This introduces a generic `FileSystemSegmentManagedBuffer`, which wraps a 
segment of a file on an Hadoop `FileSystem`. This is then used by the 
`FallbackStorage` to read block data lazily.
   
   ### Why are the changes needed?
   The `ShuffleBlockFetcherIterator` iterates over various sources of block 
data: local, host-local, push-merged local, remote and fallback storage blocks. 
It makes large efforts to keep the memory consumed during iteration low. On 
creation of the iterator, `ShuffleBlockFetcherIterator.initialize()` creates 
`ManagedBuffer`s for each local, host-local and push-merged local block. Only 
on `ShuffleBlockFetcherIterator.next()`, the `ManagedBuffer` actually reads the 
block data of the next block.
   
   Remote blocks are fetched synchronously and only up to a specific amount of 
bytes.
   
   Currently, method `FallbackStorage.read` returns a `ManagedBuffer` that 
already stores the data. Therefore, fallback storage blocks are fully read in 
`ShuffleBlockFetcherIterator.initialize()`. The entire shuffle data of the 
iterator that originates on the fallback storage is hold in memory before the 
iteration starts.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Unit tests for `FileSystemSegmentManagedBuffer` and 
`ShuffleBlockFetcherIterator`. This now explicitly tests 
`ShuffleBlockFetcherIterator` with fallback storage blocks.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to