Here is the times for KSPSolve on one node with 2,280,285 equations. These nodes seem to have 42 cores. There are 6 "devices" (GPUs) and 7 core attached to the device. The anomalous 28 core result could be from only using 4 "devices". I figure I will use 36 cores for now. I should really do this with a lot of processors to include MPI communication...
NP KSPSolve 20 5.6634e+00 24 4.7382e+00 28 6.0349e+00 32 4.7543e+00 36 4.2574e+00 42 4.2022e+00