Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/20414
Talked to @yanboliang offline, he claimed that the major use cases of
RDD/DataFrame.repartition() in ml workloads he has observed are:
1. During save models, you may need `repartition()` to reduce the number of
output files, a typical special case is `xxx.repartition(1)`;
2. You may use `repartition()` to let the original data set to have more
partitions, to gain a higher parallelism for following operations.
Actually for the first case, you shall use `coalesce()` instead of
`repartition()` to get a similar effect, without need of another shuffle! Also,
the scene don't strictly require the data set to distribute evenly, so the
change from round-robin partitioning to hash partitioning should be fine.
For the latter case, if you have a bunch of data with the same values, the
change may lead to high data skew and brings performance regression, currently
the best-effort-approach we can choose is to perform a local sort if the data
type is comparable (and that also requires a lot of work refactoring the
`ExternalSorter`).
Another approach is that we may let the `ShuffleBlockFetcherIterator` to
remember the sequence of block fetches, and force the blocks to be fetched
one-by-one. This actually involves more issues because we may face memory limit
and therefore have to spill the fetched blocks. IIUC this should resolve the
issue for general cases.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]