Github user ZunwenYou commented on the issue:

    https://github.com/apache/spark/pull/17000
  
    Hi, @MLnick
    You are right, sliceAggregate splits an array into smaller chunks before 
shuffle.
    It has three advantage
    Firstly, the shuffle data is less than treeAggregate during the whole 
transformation operation.
    Secondly, as your description, it allows more concurrency, not only during 
the collect operation of driver, but also in the process of run **_seqOp_** and 
**_combOp_**.
    Thirdly, as I observed, when an record is larger than 1G Bit(an array of 
100 million dimension), shuffle among executors becomes less efficiency. At the 
same time, the rest of executos is waiting. I am not clear the reason for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to