[ 
https://issues.apache.org/jira/browse/MATH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Barger updated MATH-1330:
-------------------------------
    Description: 
Currently *KMeansPlusPlusClusterer* class require from generic parameter *T`* 
to extend from *Clusterable* interface, which is:

{quote}
public interface Clusterable \{

    double[] getPoint();
\}
{quote}
i.e. returns dense representation of the clusterable data, hence making it 
impossible to efficiently compute kmeans clustering on big dimensional, but 
very sparse data. I think it will be much better if *Clusterable* interface 
will return a *Vector* allowing usage of *SparceVector*s while clustering the 
data. Of course *KMeansPlusPlusClusterer* implementation and I assume other 
clustering implementations should be refactored accordingly to support this.

  was:
Currently *KMeansPlusPlusClusterer* class require from generic parameter *T`* 
to extend from *Clusterable* interface, which is:
bq. public interface Clusterable {

    double[] getPoint();
}

i.e. returns dense representation of the clusterable data, hence making it 
impossible to efficiently compute kmeans clustering on big dimensional, but 
very sparse data. I think it will be much better if *Clusterable* interface 
will return a *Vector* allowing usage of *SparceVector*s while clustering the 
data. Of course *KMeansPlusPlusClusterer* implementation and I assume other 
clustering implementations should be refactored accordingly to support this.


> KMeans clustering algorithm, doesn't support clustering of sparse input data.
> -----------------------------------------------------------------------------
>
>                 Key: MATH-1330
>                 URL: https://issues.apache.org/jira/browse/MATH-1330
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Artem Barger
>
> Currently *KMeansPlusPlusClusterer* class require from generic parameter *T`* 
> to extend from *Clusterable* interface, which is:
> {quote}
> public interface Clusterable \{
>     double[] getPoint();
> \}
> {quote}
> i.e. returns dense representation of the clusterable data, hence making it 
> impossible to efficiently compute kmeans clustering on big dimensional, but 
> very sparse data. I think it will be much better if *Clusterable* interface 
> will return a *Vector* allowing usage of *SparceVector*s while clustering the 
> data. Of course *KMeansPlusPlusClusterer* implementation and I assume other 
> clustering implementations should be refactored accordingly to support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to