Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
> In aggregation we are replacing a needed shuffle with gathering only the
needed rows from the other partitions.
I don't know what this means actually. If we decided we don't need a
shuffle because the partitioning satisfies the need, I'm not sure why we still
need to gather rows from other partitions. I think it is simple, if we need
rows from other partitions, we do shuffle, if not, we avoid shuffle.
But I think this is the point we have different understanding. So as you
said, it is better to hear others opinion.
> Probably we can wait for others' opinion, but it would be also great to
have some performance tests on both cases and different scenarios in order to
better evaluate this change. What do you think?
Yeah, I think so. I will have some tests.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]