Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/15730
@brkyvz
Good question about the shuffling data in the sparse case. Now I give some
simple analysis on it (maybe not very strict):
as discussed above, the shuffling data contains step-1 and step-2,
if we keep parallelism the same, we need to increase the `midDimSplitNum`
and decrease `ShuffleRowPartitions` and `ShuffleColPartitions`, in such case,
we can make sure that step-1 shuffling data will always reduce, and step-2
shuffling data will always increase.
Now let us consider the sparse case, the step-2 shuffling data increasing
because when each pair of left-matrix row-sets multiplying right-matrix
col-sets, we split it into multiple parts (`midDimSplitNum` parts), in each
parts we do the multiplying and we need to aggregate all parts together, the
more parts need to be aggregate, the more shuffling data it needs. BUT, in
sparse case, these parts(after multiplying) will be empty in high probability,
so to those empty parts it shuffle nothing. So, we can expect that in sparse
case, step-2 shuffling data won't increase much rather than the `midDimSplitNum
= 1` case, because most split parts will shuffle nothing.
Now I am considering improve the API interface to make it easier to use.
I would like to let user specified the parallism, and the algorithm
automatically calculate the optimal `midDimSplitNum`, what do you think about
it ?
And thanks for careful review!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]