[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplica...

WeichenXu123 Sat, 19 Nov 2016 06:07:21 -0800

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/15730
  
    @brkyvz
    
    Good question about the shuffling data in the sparse case. Now I give some 
simple analysis on it (maybe not very strict):
    as discussed above, the shuffling data contains step-1 and step-2,
    if we keep parallelism the same, we need to increase the `midDimSplitNum` 
and decrease `ShuffleRowPartitions` and `ShuffleColPartitions`, in such case,  
we can make sure that step-1 shuffling data will always reduce, and step-2 
shuffling data will always increase.
    Now let us consider the sparse case, the step-2  shuffling data increasing 
because when each pair of left-matrix row-sets multiplying right-matrix 
col-sets, we split it into multiple parts (`midDimSplitNum` parts), in each 
parts we do the multiplying and we need to aggregate all parts together, the 
more parts need to be aggregate, the more shuffling data it needs. BUT, in 
sparse case, these parts(after multiplying) will be empty in high probability, 
so to those empty parts it shuffle nothing. So, we can expect that in sparse 
case, step-2 shuffling data won't increase much rather than the `midDimSplitNum 
= 1` case, because most split parts will shuffle nothing.
    
    Now I am considering improve the API interface to make it easier to use.
    I would like to let user specified the parallism, and the algorithm 
automatically calculate the optimal `midDimSplitNum`, what do you think about 
it ?
    
    And thanks for careful review!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplica...

Reply via email to