Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    @srowen yeah, the code logic here seems confusing, but I think it is right.
    Now I can explain it in a clear way:
    in essence, the logic can be expressed as following:
    A0->I1->A1->I2->A2->...
    A0 is the initial `assignments`, I1 is step-1 `indices`, A1 is step-1 
`assignment`, I2 is step-2 `indices`, and so on.
    There is dependency between them as the arrows show.
    Now the key point is that when we compute I(K), we must make sure I(K-1) is 
persisted, and I(K-2) and older ones can be unpersisted.
    NOW, check my code logic, in fact, in each iteration, I do the following 
thing:
    1. unpersist I(K-1)
    2. compute I(K+1) using A(K), and because of dependency, A(K) must use 
I(K), And I(K) is STILL PERSISTED.
    3. compute A(K+1) using I(K+1)
    
    But now I found another problem in BisectKMeans:
    in line 191 there is a iteration it also need this pattern “persist 
current step RDD, unpersist previous”
    and the iteration's first RDD relates to code here.
    The problem seems a little troblesome so we'd better create another PR to 
handle it ?
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to