Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/19184
@viirya @jerryshao To take a step back here.
This specific issue is applicable to window operations and not to shuffle.
In shuffle, you a much larger volume of data written per file vs 4k records
per file for window operation.
To get to 9k files with shuffle, you are typically processing a TB or more
data per shuffle task (unless executor is strapped of memory and spilt large
number of files).
On other hand, with 4k window size (the default in spark), getting to 9k
files is possible within a single task.
From what I see, there is actually no functional/performance reason to keep
all the files opened, unlike in shuffle.
Having said that, there is an additional cost we pay with this approach.
With N files, we incur an additional cost of `N * cost(open + read int +
close)`
Practically though, the impact is much lower when compared to current code
(since re-open + read will be disk read friendly).
While getting it fixed for all cases would be ideal, the solution for
window operation does not transfer to shuffle (and vice versa) due to the
difference in the nature of how files are used in both.
In case I missed something here, please let me know.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]