Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
To clear, now we have these issues:
1. local limit compute all partitions, that means it launch many tasks
but actually maybe very small tasks is enough.
2. global limit single partition issue, now the global limit will shuffle
all the data to one partition, so if the limit num is very big, it cause
performance bottleneck
It is perfect if we combine the global limit and local limit into one
stage, and avoid the shuffle, but for now i can not find a very good
solution(no performance regression) to do this without change spark
core/scheduler, your solution is trying to do that, but as i suggest, there are
some cases the performance maybe worse.
@wzhfy 's idea is just resolve the single partition issue, still shuffle,
still local limit on all the partitions, but it not bring performance down in
that cases compare with current code path.
> Another issue is, how do you make sure you create a uniform distribution
of the result of local limit. Each local limit can produce different number of
rows.
it use a special partitioner to do this, the partitioner like the
`row_numer` in sql it give each row a uniform partitionid, so in the reduce
task, each task handle num of rows very closely.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]