[
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478437#comment-17478437
]
zhengruifeng commented on SPARK-30661:
--------------------------------------
recently, I spend some time on testing blockify kmeans and apply GEMM in
finding the closest cluster.
In short:
1, for sparse datasets, blockifying kmeans still cause regression in most
cases; (existing impl with triangle-inequality can skip some distance
computation, but scala-based sparse BLAS will always compute all distances)
2, for dense datasets and small k, blockifying kmeans (without native BLAS) is
competitive; with native BLAS, it should be significantly faster than existing
impl.
So I plan to add a new parameter {{solver}} by making KMeans extending
HasSolver, and support both two training impls, so that end users can switch to
the blockify version.
How do you think about it? [~srowen] [~WeichenXu123] [~mengxr] [~huaxingao]
> KMeans blockify input vectors
> -----------------------------
>
> Key: SPARK-30661
> URL: https://issues.apache.org/jira/browse/SPARK-30661
> Project: Spark
> Issue Type: Sub-task
> Components: ML, PySpark
> Affects Versions: 3.0.0
> Reporter: zhengruifeng
> Assignee: zhengruifeng
> Priority: Minor
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]