Github user scwf commented on the issue:

    https://github.com/apache/spark/pull/16633
  
    To clear, now we have these issues:
    1.  local limit compute all partitions, that means it launch many tasks  
but actually maybe very small tasks is enough.
    2.  global limit single partition issue, now the global limit will shuffle 
all the data to one partition, so if the limit num is very big, it cause 
performance bottleneck 
    
    It is perfect if we combine the global limit and local limit into one 
stage, and avoid the shuffle, but for now i can not find a very good 
solution(no performance regression) to do this without change spark 
core/scheduler, your solution is trying to do that, but as i suggest, there are 
some cases the performance maybe worse.
    
    @wzhfy 's idea is just resolve the single partition issue, still shuffle, 
still local limit on all the partitions, but it not bring performance down in 
that cases compare with current code path.
    
    > Another issue is, how do you make sure you create a uniform distribution 
of the result of local limit. Each local limit can produce different number of 
rows.
    
    it use a special partitioner to do this, the partitioner like the 
`row_numer`  in sql it give each row a uniform partitionid, so in the reduce 
task, each task handle num of rows very closely.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to