For comparison with Dr. Guyer's investigation into the effect of DEBUG on execution times, in our trilinos 9.0.3 build (with mpi and no debug flags that I can find), the (trilinos PCG no precon solve time) / (pysparse solve time) is around 3 for the 1000x1000 example; in the serial build on which Guyer just reported, that ratio (for 500x500 ?) is about 4.2 (37.7/8.9). Given the vagaries of timing such things and the difference in the examples, these may be the same.
There is no significant difference on our system in the execution times for 1 processor between the mpi and serial builds. There are no explicit debug flags in our config files (from that long-ago, antediluvian world before cmake), and I haven't looked yet to see what the default may have been.
