Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/14333
@srowen yeah, the code logic here seems confusing, but I think it is right.
Now I can explain it in a clear way:
in essence, the logic can be expressed as following:
A0->I1->A1->I2->A2->...
A0 is the initial `assignments`, I1 is step-1 `indices`, A1 is step-1
`assignment`, I2 is step-2 `indices`, and so on.
There is dependency between them as the arrows show.
Now the key point is that when we compute I(K), we must make sure I(K-1) is
persisted, and I(K-2) and older ones can be unpersisted.
NOW, check my code logic, in fact, in each iteration, I do the following
thing:
1. unpersist I(K-1)
2. compute I(K+1) using A(K), and because of dependency, A(K) must use
I(K), And I(K) is STILL PERSISTED.
3. compute A(K+1) using I(K+1)
But now I found another problem in BisectKMeans:
in line 191 there is a iteration it also need this pattern âpersist
current step RDD, unpersist previousâ
and the iteration's first RDD relates to code here.
The problem seems a little troblesome so we'd better create another PR to
handle it ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]