Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/20629
So when you say "second pass over the data" - from looking at this it seems
like it would could do this with just a second map to look up the predictions
in the already computed cluster centers, not a stage boundary, so that probably
wouldn't be all that expensive given how Spark does pipe-lining unless I'm
mussing something.
This would mean that we'd have to have people set the cluster centers from
their model when they wanted to do that evaluation type but given that the
evaluate wouldn't be able to recover the cluster centers from a test that
differed from the training set I think that would be reasonable.
That being said its been awhile since I've looked at the evaluator code so
I could be coming out of left field.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]