Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/20414
> Actually for the first case, you shall use coalesce() instead of
repartition() to get a similar effect, without need of another shuffle!
Not quite - coalesce will not combine partitions across executor (aka
shuffle) so you could still end up having many many files.
I have seen that quite a bit with large scale ML. But FWIW, my comment
earlier was for both "regular" use cases and ML use cases.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]