Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/19184
Hi @mridulm , sorry for late response. I agree with you that the scenario
is different between here and shuffle, but the underlying structure and
solutions to spill data is the same, so the problem is the same. While in the
shuffle side, we could control the memory size to hold more data before
spilling to avoid too many spills, but as you mentioned here we cannot do it.
Yes it is not necessary to open all the files beforehand. But since we're
using priority queue to do merge sort, which will make all the file handler
opened very likely. And this fix only reduces the chances to encounter too many
files issue. Maybe we can call this fix as an intermittent fix, what do you
think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]