Re: [petsc-users] Rather different matrix product results on multiple processes

Peder Jørgensgaard Olesen via petsc-users Wed, 21 Apr 2021 02:22:54 -0700

Dear Hong


Thank your for your reply.


I have a hunch that the issue goes beyond the minor differences that might 
arise from floating-point computation order, however.


Writing the product matrix to a binary file using MatView() and inspecting the 
output shows very different entries depending on the number of processes. Here 
are the first three rows and columns of the product matrix obtained in a 
sequential run:

2.58348   1.68202   1.66302

1.68202   4.27506   1.91897

1.66302   1.91897   2.70028


- and the corresponding part of the product matrix obtained on one node (40 
processes):

4.43536   2.17261   0.16430

2.17261   4.53224   2.53210

0.16430   2.53210   4.73234


The parallel result is not even close to the sequential one. Trying different 
numbers of processes produces yet different results.


Also, the eigenvectors that I subsequently determine using a SLEPC solver do 
not form a proper basis for the column space of the data matrix as they must, 
which is hardly a surprise given the variability of results indicated above - 
except when the code is run on just a single process. Forming such a basis 
central to the intended application, and given that it would need to work on 
rather large data sets, running on a single process is hardly a viable solution.


Best regards

Peder

________________________________
Fra: Zhang, Hong <[email protected]>
Sendt: 19. april 2021 18:34:31
Til: [email protected]; Peder Jørgensgaard Olesen
Emne: Re: Rather different matrix product results on multiple processes

Peder,
I tested your code on a linux machine. I got
$ ./acorr_mwe
Data matrix norm: 5.0538e+01
Autocorrelation matrix norm: 1.0473e+03

mpiexec -n 40 ./acorr_mwe -matmattransmult_mpidense_mpidense_via allgatherv 
(default)
Data matrix norm: 5.0538e+01
Autocorrelation matrix norm: 1.0363e+03

mpiexec -n 20 ./acorr_mwe
Data matrix norm: 5.0538e+01
Autocorrelation matrix norm: 1.0897e+03

mpiexec -n 40 ./acorr_mwe -matmattransmult_mpidense_mpidense_via cyclic
Data matrix norm: 5.0538e+01
Autocorrelation matrix norm: 1.0363e+03

I use petsc 'main' branch (same as the latest release). You can remove 
MatAssemblyBegin/End calls after MatMatTransposeMult():
MatMatTransposeMult(data_mat, data_mat, MAT_INITIAL_MATRIX, PETSC_DEFAULT, 
&corr_mat);
//ierr = MatAssemblyBegin(corr_mat, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
//ierr = MatAssemblyEnd(corr_mat, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);

The communication patterns of parallel implementation led to different order of 
floating-point computation, thus slightly different matrix norm of R.
Hong

________________________________
From: petsc-users <[email protected]> on behalf of Peder 
Jørgensgaard Olesen via petsc-users <[email protected]>
Sent: Monday, April 19, 2021 7:57 AM
To: [email protected] <[email protected]>
Subject: [petsc-users] Rather different matrix product results on multiple 
processes


Hello,


When computing a matrix product of the type R = D.DT using 
MatMatTransposeMult() I find I get rather different results depending on the 
number of processes. In one example using a data set that is small compared to 
the application I get Frobenius norms |R| = 1.047e3 on a single process, 
1.0363e3 on a single HPC node (40 cores), and 9.7307e2 on two nodes.


I have ascertained that the single process result is indeed the correct one 
(i.e., eigenvectors of R form a proper basis for the columns of D), so 
naturally I'd love to be able to reproduce this result across different 
parallel setups. How might I achieve this?


I'm attaching MWE code and the data set used for the example.


Thanks in advance!


Best Regards


Peder Jørgensgaard Olesen

PhD Student, Turbulence Research Lab

Dept. of Mechanical Engineering

Technical University of Denmark

Niels Koppels Allé

Bygning 403, Rum 105

DK-2800 Kgs. Lyngby

Re: [petsc-users] Rather different matrix product results on multiple processes

Reply via email to