Fande,
  I ran your code with two processes and found the poor performance of 
PetscSortIntWithArrayPair() was due to duplicates.  In particular,  rank 0 has 
array length = 0 and rank 1 has array length = 4,180,070. On rank 1, each 
unique array value has ~95 duplicates; The duplicates are already clustered 
together before sorting.
  The old quick sort algorithm has an O(n^2) complexity on these duplicates. I 
added a three-way-partition algorithm to split the array into <, =, >pivot 
parts and get these numbers:

Master:
$ time mpirun -n 2 ./ex_petsc_only
real 0m55.359s
user 1m7.807s
sys 0m42.651s

Master:
$ time mpirun -n 2 ./ex_petsc_only -matstash_legacy
real 0m0.987s
user 0m1.565s
sys 0m0.285s

Three way partition
$ time mpirun -n 2 ./ex_petsc_only
real 0m1.015s
user 0m1.535s
sys 0m0.392s

We can see the new sort algorithm gives a 55x speedup and can almost catch up 
with -matstash_legacy. So I think it is good to have no matter whether we will 
do hashing or not in mat assembly.
Please try branch jczhang/feature-better-quicksort-pivot on your side to see if 
it helps and if the segfault persist.
Thanks.

--Junchao Zhang


On Tue, Jul 9, 2019 at 10:18 AM Fande Kong 
<[email protected]<mailto:[email protected]>> wrote:
Hi Junchao,

If you are struggling on building the right package environment, here is a 
native PETSc example.


Fande,

On Mon, Jul 8, 2019 at 3:00 PM Fande Kong 
<[email protected]<mailto:[email protected]>> wrote:
I guess John has a pure PETSc example,

John, could you share the pure PETSc example with Junchao?


Fande,

On Mon, Jul 8, 2019 at 2:58 PM Fande Kong 
<[email protected]<mailto:[email protected]>> wrote:
Yes, here it is https://github.com/fdkong/matrixsparsity


You need to follow instructions here to install MOOSE 
https://www.mooseframework.org/getting_started/installation/mac_os.html


Thanks for your help.


Fande



On Mon, Jul 8, 2019 at 2:28 PM Zhang, Junchao 
<[email protected]<mailto:[email protected]>> wrote:
Is the code public for me to test?
--Junchao Zhang


On Mon, Jul 8, 2019 at 3:06 PM Fande Kong 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Junchao,

Tried your code. I did not hit seg fault this time, but the assembly was still 
slow


time mpirun -n 2 ./matrix_sparsity-opt   -matstash_legacy
Close matrix for np = 2 ...
Matrix successfully closed

real 0m2.009s
user 0m3.324s
sys 0m0.575s




 time mpirun -n 2 ./matrix_sparsity-opt
Close matrix for np = 2 ...
Matrix successfully closed

real 3m39.235s
user 6m42.184s
sys 0m35.084s




Fande,




On Mon, Jul 8, 2019 at 8:47 AM Fande Kong 
<[email protected]<mailto:[email protected]>> wrote:
Will let you know soon.

Thanks,

Fande,

On Mon, Jul 8, 2019 at 8:41 AM Zhang, Junchao 
<[email protected]<mailto:[email protected]>> wrote:
Fande or John,
  Could any of you have a try? Thanks
--Junchao Zhang


---------- Forwarded message ---------
From: Junchao Zhang <[email protected]<mailto:[email protected]>>
Date: Thu, Jul 4, 2019 at 8:21 AM
Subject: Re: [petsc-dev] Slowness of PetscSortIntWithArrayPair in MatAssembly
To: Fande Kong <[email protected]<mailto:[email protected]>>


Fande,
  I wrote tests but could not reproduce the error. I pushed a commit that 
changed the MEDIAN macro to a function to make it easier to debug.  Could you 
run and debug it again? It should be easy to see what is wrong in gdb.
  Thanks.
--Junchao Zhang


On Wed, Jul 3, 2019 at 6:48 PM Fande Kong 
<[email protected]<mailto:[email protected]>> wrote:
Process 3915 resuming
Process 3915 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=2, address=0x7ffee9b91fc8)
    frame #0: 0x000000010cbaa031 
libpetsc.3.011.dylib`PetscSortIntWithArrayPair_Private(L=0x0000000119fc5480, 
J=0x000000011bfaa480, K=0x000000011ff74480, right=13291) at sorti.c:298
   295      }
   296      PetscFunctionReturn(0);
   297    }
-> 298    i    = MEDIAN(L,right);
   299    SWAP3(L[0],L[i],J[0],J[i],K[0],K[i],tmp);
   300    vl   = L[0];
   301    last = 0;
(lldb)


On Wed, Jul 3, 2019 at 4:32 PM Zhang, Junchao 
<[email protected]<mailto:[email protected]>> wrote:
Could you debug it or paste the stack trace? Since it is a segfault, it should 
be easy.
--Junchao Zhang


On Wed, Jul 3, 2019 at 5:16 PM Fande Kong 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Junchao,

But there is still segment fault. I guess you could write some continuous 
integers to test your changes.


Fande

On Wed, Jul 3, 2019 at 12:57 PM Zhang, Junchao 
<[email protected]<mailto:[email protected]>> wrote:
Fande and John,
  Could you try jczhang/feature-better-quicksort-pivot? It passed Jenkins tests 
and I could not imagine why it failed on yours.
  Hash table has its own cost. We'd better get quicksort right and see how it 
performs before rewriting code.
--Junchao Zhang


On Tue, Jul 2, 2019 at 2:37 PM Fande Kong 
<[email protected]<mailto:[email protected]>> wrote:
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 
(signal 11)

Segmentation fault :-)


As Jed said, it might be a good idea to rewrite the code using the hashing 
table.


Fande,


On Tue, Jul 2, 2019 at 1:27 PM Zhang, Junchao 
<[email protected]<mailto:[email protected]>> wrote:
Try this to see if it helps:

diff --git a/src/sys/utils/sorti.c b/src/sys/utils/sorti.c
index 1b07205a..90779891 100644
--- a/src/sys/utils/sorti.c
+++ b/src/sys/utils/sorti.c
@@ -294,7 +294,8 @@ static PetscErrorCode 
PetscSortIntWithArrayPair_Private(PetscInt *L,PetscInt *J,
     }
     PetscFunctionReturn(0);
   }
-  SWAP3(L[0],L[right/2],J[0],J[right/2],K[0],K[right/2],tmp);
+  i = MEDIAN(L,right);
+  SWAP3(L[0],L[i],J[0],J[i],K[0],K[i],tmp);
   vl   = L[0];
   last = 0;
   for (i=1; i<=right; i++) {


On Tue, Jul 2, 2019 at 12:14 PM Fande Kong via petsc-dev 
<[email protected]<mailto:[email protected]>> wrote:
BTW,

PetscSortIntWithArrayPair is used in MatStashSortCompress_Private.

Any way to avoid to use PetscSortIntWithArrayPair in 
MatStashSortCompress_Private?

Fande,

On Tue, Jul 2, 2019 at 11:09 AM Fande Kong 
<[email protected]<mailto:[email protected]>> wrote:
Hi Developers,

John just noticed that the matrix assembly was slow when having sufficient 
amount of off-diagonal entries. It was not a MPI issue since I was  able to 
reproduce the issue using two cores on my desktop, that is, "mpirun -n 2".

I turned  on a profiling, and 99.99% of the time was spent on 
PetscSortIntWithArrayPair (recursively calling).   It took THREE MINUTES  to 
get the assembly done. And then changed to use the option "-matstash_legacy" to 
restore
the code to the old assembly routine, and the same code took ONE SECOND to get 
the matrix assembly done.

Should write any better sorting algorithms?


Fande,

Reply via email to