Re: [petsc-users] Gather and Broadcast Parallel Vectors in k-means algorithm

Eda Oktay Thu, 25 Jun 2020 08:10:24 -0700

Dear Richard,

>From now on, I am actually trying to write a parallel k-means
algorithm by using petsc routines (I don't have to use petsc but I
believe it will be easier) and I used the algorithm you mentioned
before about finding cluster centroids. However, something is
bothering me:


You said that by using MPI_Allreduce, I should "calculate the global
element wise sum of all the local sums and finally divide by the
number of members of that cluster to get the centroid."

But if I do it this way, I think each cluster centroid will be the
same, right? But when I read the k-means algorithm, I thought that
each cluster centroid should be different since elements in each
cluster are different.

Can you enlighten me?

Thanks!

Eda

Mills, Richard Tran <[email protected]>, 30 Nis 2020 Per, 02:07
tarihinde şunu yazdı:
>
> Hi Eda,
>
> Thanks for your reply. I'm still trying to understand why you say you need to 
> duplicate the row vectors across all processes. When I have implemented 
> parallel k-means, I don't duplicate the row vectors. (This would be very 
> unscalable and largely defeat the point of doing this with MPI parallelism in 
> the first place.)
>
> Earlier in this email thread, you said that you have used Matlab to get 
> cluster IDs for each row vector. Are you trying to then use this information 
> to calculate the cluster centroids from inside your PETSc program? If so, you 
> can do this by having each MPI rank do the following: For cluster i in 0 to 
> (k-1), calculate the element-wise sum of all of the local rows that belong to 
> cluster i, then use MPI_Allreduce() to calculate the global elementwise sum 
> of all the local sums (this array will be replicated across all MPI ranks), 
> and finally divide by the number of members of that cluster to get the 
> centroid. Note that MPI_Allreduce() doesn't work on PETSc objects, but simple 
> arrays, so you'll want to use something like MatGetValues() or MatGetRow() to 
> access the elements of your row vectors.
>
> Let me know if I am misunderstanding what you are aiming to do, or if I am 
> misunderstanding something.
>
> It sounds like you would benefit from having some routines in PETSc to do 
> k-means (or other) clustering, by the way?
>
> Best regards,
> Richard
>
> On 4/29/20 3:47 AM, Eda Oktay wrote:
>
> Dear Richard,
>
> I am trying to use spectral clustering algorithm by using k-means clustering 
> algorithm at some point. I am doing this by producing a matrix consisting of 
> eigenvectors (of the adjacency matrix of the graph that I want to partition), 
> then forming row vectors of this matrix. This is the part that I am using 
> parallel vector. By using the output from k-means, I am trying to cluster 
> these row vectors. To cluster these vectors, I think I need all row vectors 
> in all processes. I wanted to use sequential vectors, however, I couldn't 
> find a different way that I form row vectors of a matrix.
>
> I am trying to use VecScatterCreateToAll, however, since my vector is 
> parallel crated by VecDuplicateVecs, my input is not in correct type, so I 
> get error. I still can't get how can I use this function in parallel vector 
> created by VecDuplicateVecs.
>
> Thank you all for your help.
>
> Eda
>
> Mills, Richard Tran <[email protected]>, 7 Nis 2020 Sal, 01:51 tarihinde şunu 
> yazdı:
>>
>> Hi Eda,
>>
>> I think that you probably want to use VecScatter routines, as Junchao
>> has suggested, instead of the lower level star forest for this. I
>> believe that VecScatterCreateToZero() is what you want for the broadcast
>> problem you describe, in the second part of your question. I'm not sure
>> what you are trying to do in the first part. Taking a parallel vector
>> and then copying its entire contents to a sequential vector residing on
>> each process is not scalable, and a lot of the design that has gone into
>> PETSc is to prevent the user from ever needing to do things like that.
>> Can you please tell us what you intend to do with these sequential vectors?
>>
>> I'm also wondering why, later in your message, you say that you get
>> cluster assignments from Matlab, and then "to cluster row vectors
>> according to this information, all processors need to have all of the
>> row vectors". Do you mean you want to get all of the row vectors copied
>> onto all of the processors so that you can compute the cluster
>> centroids? If so, computing the cluster centroids can be done without
>> copying the row vectors onto all processors if you use a communication
>> operation like MPI_Allreduce().
>>
>> Lastly, let me add that I've done a fair amount of work implementing
>> clustering algorithms on distributed memory parallel machines, but
>> outside of PETSc. I was thinking that I should implement some of these
>> routines using PETSc. I can't get to this immediately, but I'm wondering
>> if you might care to tell me a bit more about the clustering problems
>> you need to solve and how having some support for this in PETSc might
>> (or might not) help.
>>
>> Best regards,
>> Richard
>>
>> On 4/4/20 1:39 AM, Eda Oktay wrote:
>> > Hi all,
>> >
>> > I created a parallel vector UV, by using VecDuplicateVecs since I need
>> > row vectors of a matrix. However, I need the whole vector be in all
>> > processors, which means I need to gather all and broadcast them to all
>> > processors. To gather, I tried to use VecStrideGatherAll:
>> >
>> >   Vec UVG;
>> >   VecStrideGatherAll(UV,UVG,INSERT_VALUES);
>> >   VecView(UVG,PETSC_VIEWER_STDOUT_WORLD);
>> >
>> >  however when I try to view the vector, I get the following error.
>> >
>> > [3]PETSC ERROR: Invalid argument
>> > [3]PETSC ERROR: Wrong type of object: Parameter # 1
>> > [3]PETSC ERROR: See
>> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
>> > [3]PETSC ERROR: Petsc Release Version 3.11.1, Apr, 12, 2019
>> > [3]PETSC ERROR: ./clustering_son_final_edgecut_without_parmetis on a
>> > arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr  4
>> > 11:22:54 2020
>> > [3]PETSC ERROR: Wrong type of object: Parameter # 1
>> > [0]PETSC ERROR: See
>> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
>> > [0]PETSC ERROR: Petsc Release Version 3.11.1, Apr, 12, 2019
>> > [0]PETSC ERROR: ./clustering_son_final_edgecut_without_parmetis on a
>> > arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr  4
>> > 11:22:54 2020
>> > [0]PETSC ERROR: Configure options --download-mpich --download-openblas
>> > --download-slepc --download-metis --download-parmetis --download-chaco
>> > --with-X=1
>> > [0]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
>> > /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
>> > ./clustering_son_final_edgecut_without_parmetis on a
>> > arch-linux2-c-debug named localhost.localdomain by edaoktay Sat Apr  4
>> > 11:22:54 2020
>> > [1]PETSC ERROR: Configure options --download-mpich --download-openblas
>> > --download-slepc --download-metis --download-parmetis --download-chaco
>> > --with-X=1
>> > [1]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
>> > /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
>> > Configure options --download-mpich --download-openblas
>> > --download-slepc --download-metis --download-parmetis --download-chaco
>> > --with-X=1
>> > [3]PETSC ERROR: #1 VecStrideGatherAll() line 646 in
>> > /home/edaoktay/petsc-3.11.1/src/vec/vec/utils/vinv.c
>> >
>> > I couldn't understand why I am getting this error. Is this because of
>> > UV being created by VecDuplicateVecs? How can I solve this problem?
>> >
>> > The other question is broadcasting. After gathering all elements of
>> > the vector UV, I need to broadcast them to all processors. I found
>> > PetscSFBcastBegin. However, I couldn't understand the PetscSF concept
>> > properly. I couldn't adjust my question to the star forest concept.
>> >
>> > My problem is: If I have 4 processors, I create a matrix whose columns
>> > are 4 smallest eigenvectors, say of size 72. Then by defining each row
>> > of this matrix as a vector, I cluster them by using k-means
>> > clustering algorithm. For now, I cluster them by using MATLAB and I
>> > obtain a vector showing which row vector is in which cluster. After
>> > getting this vector, to cluster row vectors according to this
>> > information, all processors need to have all of the row vectors.
>> >
>> > According to this problem, how can I use the star forest concept?
>> >
>> > I will be glad if you can help me about this problem since I don't
>> > have enough knowledge about graph theory. An if you have any idea
>> > about how can I use k-means algorithm in a more practical way, please
>> > let me know.
>> >
>> > Thanks!
>> >
>> > Eda
>
>

Re: [petsc-users] Gather and Broadcast Parallel Vectors in k-means algorithm

Reply via email to