by default, the clustering classes from sklearn, (e.g., DBSCAN), take an
[num_examples, num_features] array as input, but you can also provide the
distance matrix directly, e.g., by instantiating it with metric='precomputed'
my_dbscan = DBSCAN(..., metric='precomputed')
Not sure if it helps in that particular case (depending on how many zero
elements you have), you can also use a sparse matrix in CSR format
Also, you don't need to for-loop through the rows if you want to compute the
pair-wise distances, you can simply do that on the complete array. E.g.,
from sklearn.metrics.pairwise import cosine_distances
from scipy import sparse
distance_matrix = cosine_distances(sparse.csr_matrix(X), dense_output=False)
where X is your "[num_examples, num_features]" array.
> On Feb 12, 2018, at 1:10 PM, prince gosavi <princegosav...@gmail.com> wrote:
> I have generated a cosine distance matrix and would like to apply clustering
> algorithm to the given matrix.
> I would like to know which clustering suits better and is there any need to
> process the data further to get it in the form so that a model can be applied.
> Also any performance tip as the matrix takes around 3-4 hrs of processing.
> You can find my code here
> Code for READ ONLY PURPOSE.
> scikit-learn mailing list
scikit-learn mailing list