Barry,

That sucks. I am sure that it is not a single processor machine
although I've not yet opened it up and checked it for sure ;) It is
dual booted with windows and I am going to use the Intel performance
counters to find the bandwidth limit in windows/linux. Also, I did
find a benchmark for Ubuntu after bit of searching around and will try
to see if it can provide more details. Here are the links for the
benchmarks.

http://software.intel.com/en-us/articles/intel-performance-counter-monitor/
http://manpages.ubuntu.com/manpages/maverick/lmbench.8.html

Hopefully the numbers from Windows and Ubuntu will match and if not,
maybe my Ubuntu configuration needs a bit of tweaking to get this
correct. I will keep you updated if I find something interesting.
Thanks for all the helpful comments !

Vijay

On Thu, Feb 3, 2011 at 4:46 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> ? Based on these numbers (that is assuming these numbers are a correct 
> accounting of how much memory bandwidth you can get from the system*) you 
> essentially have a one processor machine that they sold to you as a 8 
> processor machine for sparse matrix computation. The one core run is using 
> almost all the memory bandwidth, adding more cores in the computation helps 
> very little because it is completely starved for memory bandwidth.
>
> ? Barry
>
> * perhaps something in the OS is not configured correctly and thus not 
> allowing access to all the memory bandwidth, but this seems unlikely.
>
> On Feb 3, 2011, at 4:29 PM, Vijay S. Mahadevan wrote:
>
>> Barry,
>>
>> The outputs are attached. I do not see a big difference from the
>> earlier results as you mentioned.
>>
>> Let me know if there exist a similar benchmark that might help.
>>
>> Vijay
>>
>> On Thu, Feb 3, 2011 at 4:00 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>
>>> ? Hmm, just running the basic version with mpiexec -n 2 processes isn't 
>>> that useful because there is nothing to make sure they are both running at 
>>> exactly the same time.
>>>
>>> ? I've attached a new version of BasicVersion.c that attempts to 
>>> synchronize the operations in the two processes using MPI_Barrier()
>>> ; it is probably not a great way to do it, but better than nothing. Please 
>>> try that one.
>>>
>>> ? ?Thanks
>>>
>>>
>>> ? Barry
>>>
>>>
>>> On Feb 3, 2011, at 1:41 PM, Vijay S. Mahadevan wrote:
>>>
>>>> Barry,
>>>>
>>>> Thanks for the quick reply. I ran the benchmark/stream/BasicVersion
>>>> for one and two processes and the output are as follows:
>>>>
>>>> -n 1
>>>> -------------------------------------------------------------
>>>> This system uses 8 bytes per DOUBLE PRECISION word.
>>>> -------------------------------------------------------------
>>>> Array size = 2000000, Offset = 0
>>>> Total memory required = 45.8 MB.
>>>> Each test is run 50 times, but only
>>>> the *best* time for each is used.
>>>> -------------------------------------------------------------
>>>> Your clock granularity/precision appears to be 1 microseconds.
>>>> Each test below will take on the order of 2529 microseconds.
>>>> ? (= 2529 clock ticks)
>>>> Increase the size of the arrays if this shows that
>>>> you are not getting at least 20 clock ticks per test.
>>>> -------------------------------------------------------------
>>>> WARNING -- The above is only a rough guideline.
>>>> For best results, please be sure you know the
>>>> precision of your system timer.
>>>> -------------------------------------------------------------
>>>> Function ? ? ?Rate (MB/s) ? RMS time ? ? Min time ? ? Max time
>>>> Copy: ? ? ? 10161.8510 ? ? ? 0.0032 ? ? ? 0.0031 ? ? ? 0.0037
>>>> Scale: ? ? ? 9843.6177 ? ? ? 0.0034 ? ? ? 0.0033 ? ? ? 0.0038
>>>> Add: ? ? ? ?10656.7114 ? ? ? 0.0046 ? ? ? 0.0045 ? ? ? 0.0053
>>>> Triad: ? ? ?10799.0448 ? ? ? 0.0046 ? ? ? 0.0044 ? ? ? 0.0054
>>>>
>>>> -n 2
>>>> -------------------------------------------------------------
>>>> This system uses 8 bytes per DOUBLE PRECISION word.
>>>> -------------------------------------------------------------
>>>> Array size = 2000000, Offset = 0
>>>> Total memory required = 45.8 MB.
>>>> Each test is run 50 times, but only
>>>> the *best* time for each is used.
>>>> -------------------------------------------------------------
>>>> Your clock granularity/precision appears to be 1 microseconds.
>>>> Each test below will take on the order of 4320 microseconds.
>>>> ? (= 4320 clock ticks)
>>>> Increase the size of the arrays if this shows that
>>>> you are not getting at least 20 clock ticks per test.
>>>> -------------------------------------------------------------
>>>> WARNING -- The above is only a rough guideline.
>>>> For best results, please be sure you know the
>>>> precision of your system timer.
>>>> -------------------------------------------------------------
>>>> Function ? ? ?Rate (MB/s) ? RMS time ? ? Min time ? ? Max time
>>>> Copy: ? ? ? ?5739.9704 ? ? ? 0.0058 ? ? ? 0.0056 ? ? ? 0.0063
>>>> Scale: ? ? ? 5839.3617 ? ? ? 0.0058 ? ? ? 0.0055 ? ? ? 0.0062
>>>> Add: ? ? ? ? 6116.9323 ? ? ? 0.0081 ? ? ? 0.0078 ? ? ? 0.0085
>>>> Triad: ? ? ? 6021.0722 ? ? ? 0.0084 ? ? ? 0.0080 ? ? ? 0.0088
>>>> -------------------------------------------------------------
>>>> This system uses 8 bytes per DOUBLE PRECISION word.
>>>> -------------------------------------------------------------
>>>> Array size = 2000000, Offset = 0
>>>> Total memory required = 45.8 MB.
>>>> Each test is run 50 times, but only
>>>> the *best* time for each is used.
>>>> -------------------------------------------------------------
>>>> Your clock granularity/precision appears to be 1 microseconds.
>>>> Each test below will take on the order of 2954 microseconds.
>>>> ? (= 2954 clock ticks)
>>>> Increase the size of the arrays if this shows that
>>>> you are not getting at least 20 clock ticks per test.
>>>> -------------------------------------------------------------
>>>> WARNING -- The above is only a rough guideline.
>>>> For best results, please be sure you know the
>>>> precision of your system timer.
>>>> -------------------------------------------------------------
>>>> Function ? ? ?Rate (MB/s) ? RMS time ? ? Min time ? ? Max time
>>>> Copy: ? ? ? ?6091.9448 ? ? ? 0.0056 ? ? ? 0.0053 ? ? ? 0.0061
>>>> Scale: ? ? ? 5501.1775 ? ? ? 0.0060 ? ? ? 0.0058 ? ? ? 0.0062
>>>> Add: ? ? ? ? 5960.4640 ? ? ? 0.0084 ? ? ? 0.0081 ? ? ? 0.0087
>>>> Triad: ? ? ? 5936.2109 ? ? ? 0.0083 ? ? ? 0.0081 ? ? ? 0.0089
>>>>
>>>> I do not have OpenMP installed and so not sure if you wanted that when
>>>> you said two threads. I also closed most of the applications that were
>>>> open before running these tests and so they should hopefully be
>>>> accurate.
>>>>
>>>> Vijay
>>>>
>>>>
>>>> On Thu, Feb 3, 2011 at 1:17 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>>
>>>>> ?Vljay
>>>>>
>>>>> ? Let's just look at a single embarrassingly parallel computation in the 
>>>>> run, this computation has NO communication and uses NO MPI and NO 
>>>>> synchronization between processes
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> Event ? ? ? ? ? ? ? ?Count ? ? ?Time (sec) ? ? Flops ? ? ? ? ? ? ? ? ? ? 
>>>>> ? ? ? ? --- Global --- ?--- Stage --- ? Total
>>>>> ? ? ? ? ? ? ? ? ? Max Ratio ?Max ? ? Ratio ? Max ?Ratio ?Mess ? Avg len 
>>>>> Reduct ?%T %F %M %L %R ?%T %F %M %L %R Mflop/s
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> ?1 process
>>>>> VecMAXPY ? ? ? ? ? ?3898 1.0 1.7074e+01 1.0 3.39e+10 1.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00 15 20 ?0 ?0 ?0 ?29 40 ?0 ?0 ?0 ?1983
>>>>>
>>>>> ?2 processes
>>>>> VecMAXPY ? ? ? ? ? ?3898 1.0 1.3861e+01 1.0 1.72e+10 1.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00 15 20 ?0 ?0 ?0 ?31 40 ?0 ?0 ?0 ?2443
>>>>>
>>>>> ? The speed up is 1.7074e+01/1.3861e+01 = 2443./1983 = 1.23 ?which is 
>>>>> terrible! Now why would it be so bad (remember you cannot blame MPI)
>>>>>
>>>>> 1) other processes are running on the machine sucking up memory 
>>>>> bandwidth. Make sure no other compute tasks are running during this time.
>>>>>
>>>>> 2) the single process run is able to use almost all of the hardware 
>>>>> memory bandwidth, so introducing the second process cannot increase the 
>>>>> performance much. This means this machine is terrible for parallelization 
>>>>> of sparse iterative solvers.
>>>>>
>>>>> 3) the machine is somehow misconfigured (beats me how) so that while the 
>>>>> one process job doesn't use more than half of the memory bandwidth, when 
>>>>> two processes are run the second process cannot utilize all that 
>>>>> additional memory bandwidth.
>>>>>
>>>>> ?In src/benchmarks/streams you can run make test and have it generate a 
>>>>> report of how the streams benchmark is able to utilize the memory 
>>>>> bandwidth. Run that and send us the output (run with just 2 threads).
>>>>>
>>>>> ? Barry
>>>>>
>>>>>
>>>>> On Feb 3, 2011, at 12:05 PM, Vijay S. Mahadevan wrote:
>>>>>
>>>>>> Matt,
>>>>>>
>>>>>> I apologize for the incomplete information. Find attached the
>>>>>> log_summary for all the cases.
>>>>>>
>>>>>> The dual quad-core system has 12 GB DDR3 SDRAM at 1333MHz with
>>>>>> 2x2GB/2x4GB configuration. I do not know how to decipher the memory
>>>>>> bandwidth with this information but if you need anything more, do let
>>>>>> me know.
>>>>>>
>>>>>> VIjay
>>>>>>
>>>>>> On Thu, Feb 3, 2011 at 11:42 AM, Matthew Knepley <knepley at gmail.com> 
>>>>>> wrote:
>>>>>>> On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at 
>>>>>>> gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Barry,
>>>>>>>>
>>>>>>>> Sorry about the delay in the reply. I did not have access to the
>>>>>>>> system to test out what you said, until now.
>>>>>>>>
>>>>>>>> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20
>>>>>>>> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5
>>>>>>>>
>>>>>>>> processor ? ? ? time
>>>>>>>> 1 ? ? ? ? ? ? ? ? ? ? ?114.2
>>>>>>>> 2 ? ? ? ? ? ? ? ? ? ? ?89.45
>>>>>>>> 4 ? ? ? ? ? ? ? ? ? ? ?81.01
>>>>>>>
>>>>>>> 1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything from
>>>>>>> this data.
>>>>>>> 2) Do you know the memory bandwidth characteristics of this machine? 
>>>>>>> That is
>>>>>>> crucial and
>>>>>>> ? ? you cannot begin to understand speedup on it until you do. Please 
>>>>>>> look
>>>>>>> this up.
>>>>>>> 3) Worrying about specifics of the MPI implementation makes no sense 
>>>>>>> until
>>>>>>> the basics are nailed down.
>>>>>>> ? ?Matt
>>>>>>>
>>>>>>>>
>>>>>>>> The scaleup doesn't seem to be optimal, even with two processors. I am
>>>>>>>> wondering if the fault is in the MPI configuration itself. Are these
>>>>>>>> results as you would expect ? I can also send you the log_summary for
>>>>>>>> all cases if that will help.
>>>>>>>>
>>>>>>>> Vijay
>>>>>>>>
>>>>>>>> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote:
>>>>>>>>>
>>>>>>>>>> Barry,
>>>>>>>>>>
>>>>>>>>>> I understand what you are saying but which example/options then is 
>>>>>>>>>> the
>>>>>>>>>> best one to compute the scalability in a multi-core machine ? I chose
>>>>>>>>>> the nonlinear diffusion problem specifically because of its inherent
>>>>>>>>>> stiffness that could lead probably provide noticeable scalability in 
>>>>>>>>>> a
>>>>>>>>>> multi-core system. From your experience, do you think there is 
>>>>>>>>>> another
>>>>>>>>>> example program that will demonstrate this much more rigorously or
>>>>>>>>>> clearly ? Btw, I dont get good speedup even for 2 processes with
>>>>>>>>>> ex20.c and that was the original motivation for this thread.
>>>>>>>>>
>>>>>>>>> ? Did you follow my instructions?
>>>>>>>>>
>>>>>>>>> ? Barry
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Satish. I configured with --download-mpich now without the
>>>>>>>>>> mpich-device. The results are given above. I will try with the 
>>>>>>>>>> options
>>>>>>>>>> you provided although I dont entirely understand what they mean, 
>>>>>>>>>> which
>>>>>>>>>> kinda bugs me.. Also is OpenMPI the preferred implementation in 
>>>>>>>>>> Ubuntu
>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>> Vijay
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> 
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> ? Ok, everything makes sense. Looks like you are using two level
>>>>>>>>>>> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type 
>>>>>>>>>>> redundant
>>>>>>>>>>> -mg_coarse_redundant_pc_type lu ?This means it is solving the 
>>>>>>>>>>> coarse grid
>>>>>>>>>>> problem redundantly on each process (each process is solving the 
>>>>>>>>>>> entire
>>>>>>>>>>> coarse grid solve using LU factorization). The time for the 
>>>>>>>>>>> factorization is
>>>>>>>>>>> (in the two process case)
>>>>>>>>>>>
>>>>>>>>>>> MatLUFactorNum ? ? ? ?14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00
>>>>>>>>>>> 0.0e+00 0.0e+00 37 41 ?0 ?0 ?0 ?74 82 ?0 ?0 ?0 ?1307
>>>>>>>>>>> MatILUFactorSym ? ? ? ?7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00
>>>>>>>>>>> 0.0e+00 7.0e+00 ?0 ?0 ?0 ?0 ?1 ? 0 ?0 ?0 ?0 ?2 ? ? 0
>>>>>>>>>>>
>>>>>>>>>>> which is 74 percent of the total solve time (and 84 percent of the
>>>>>>>>>>> flops). ? When 3/4th of the entire run is not parallel at all you 
>>>>>>>>>>> cannot
>>>>>>>>>>> expect much speedup. ?If you run with -snes_view it will display 
>>>>>>>>>>> exactly the
>>>>>>>>>>> solver being used. You cannot expect to understand the performance 
>>>>>>>>>>> if you
>>>>>>>>>>> don't understand what the solver is actually doing. Using a 20 by 
>>>>>>>>>>> 20 by 20
>>>>>>>>>>> coarse grid is generally a bad idea since the code spends most of 
>>>>>>>>>>> the time
>>>>>>>>>>> there, stick with something like 5 by 5 by 5.
>>>>>>>>>>>
>>>>>>>>>>> ?Suggest running with the default grid and -dmmg_nlevels 5 now the
>>>>>>>>>>> percent in the coarse solve will be a trivial percent of the run 
>>>>>>>>>>> time.
>>>>>>>>>>>
>>>>>>>>>>> ?You should get pretty good speed up for 2 processes but not much
>>>>>>>>>>> better speedup for four processes because as Matt noted the 
>>>>>>>>>>> computation is
>>>>>>>>>>> memory bandwidth limited;
>>>>>>>>>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers.
>>>>>>>>>>>  Note
>>>>>>>>>>> also that this is running multigrid which is a fast solver, but 
>>>>>>>>>>> doesn't
>>>>>>>>>>> parallel scale as well many slow algorithms. For example if you run
>>>>>>>>>>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2
>>>>>>>>>>> processors but crummy speed.
>>>>>>>>>>>
>>>>>>>>>>> ?Barry
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Barry,
>>>>>>>>>>>>
>>>>>>>>>>>> Please find attached the patch for the minor change to control the
>>>>>>>>>>>> number of elements from command line for snes/ex20.c. I know that
>>>>>>>>>>>> this
>>>>>>>>>>>> can be achieved with -grid_x etc from command_line but thought this
>>>>>>>>>>>> just made the typing for the refinement process a little easier. I
>>>>>>>>>>>> apologize if there was any confusion.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, find attached the full log summaries for -np=1 and -np=2.
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> Vijay
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ?We need all the information from -log_summary to see what is 
>>>>>>>>>>>>> going
>>>>>>>>>>>>> on.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ?Not sure what -grid 20 means but don't expect any good parallel
>>>>>>>>>>>>> performance with less than at least 10,000 unknowns per process.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ? Barry
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's the performance statistic on 1 and 2 processor runs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20
>>>>>>>>>>>>>> -log_summary
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total
>>>>>>>>>>>>>> Time (sec): ? ? ? ? ? 8.452e+00 ? ? ?1.00000 ? 8.452e+00
>>>>>>>>>>>>>> Objects: ? ? ? ? ? ? ?1.470e+02 ? ? ?1.00000 ? 1.470e+02
>>>>>>>>>>>>>> Flops: ? ? ? ? ? ? ? ?5.045e+09 ? ? ?1.00000 ? 5.045e+09 
>>>>>>>>>>>>>> ?5.045e+09
>>>>>>>>>>>>>> Flops/sec: ? ? ? ? ? ?5.969e+08 ? ? ?1.00000 ? 5.969e+08 
>>>>>>>>>>>>>> ?5.969e+08
>>>>>>>>>>>>>> MPI Messages: ? ? ? ? 0.000e+00 ? ? ?0.00000 ? 0.000e+00 
>>>>>>>>>>>>>> ?0.000e+00
>>>>>>>>>>>>>> MPI Message Lengths: ?0.000e+00 ? ? ?0.00000 ? 0.000e+00 
>>>>>>>>>>>>>> ?0.000e+00
>>>>>>>>>>>>>> MPI Reductions: ? ? ? 4.440e+02 ? ? ?1.00000
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20
>>>>>>>>>>>>>> -log_summary
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total
>>>>>>>>>>>>>> Time (sec): ? ? ? ? ? 7.851e+00 ? ? ?1.00000 ? 7.851e+00
>>>>>>>>>>>>>> Objects: ? ? ? ? ? ? ?2.000e+02 ? ? ?1.00000 ? 2.000e+02
>>>>>>>>>>>>>> Flops: ? ? ? ? ? ? ? ?4.670e+09 ? ? ?1.00580 ? 4.657e+09 
>>>>>>>>>>>>>> ?9.313e+09
>>>>>>>>>>>>>> Flops/sec: ? ? ? ? ? ?5.948e+08 ? ? ?1.00580 ? 5.931e+08 
>>>>>>>>>>>>>> ?1.186e+09
>>>>>>>>>>>>>> MPI Messages: ? ? ? ? 7.965e+02 ? ? ?1.00000 ? 7.965e+02 
>>>>>>>>>>>>>> ?1.593e+03
>>>>>>>>>>>>>> MPI Message Lengths: ?1.412e+07 ? ? ?1.00000 ? 1.773e+04 
>>>>>>>>>>>>>> ?2.824e+07
>>>>>>>>>>>>>> MPI Reductions: ? ? ? 1.046e+03 ? ? ?1.00000
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am not entirely sure if I can make sense out of that statistic
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> if there is something more you need, please feel free to let me
>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at 
>>>>>>>>>>>>>> gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan
>>>>>>>>>>>>>>> <vijay.m at gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Matt,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The -with-debugging=1 option is certainly not meant for
>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>> studies but I didn't expect it to yield the same cpu time as a
>>>>>>>>>>>>>>>> single
>>>>>>>>>>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors
>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>> approximately the same amount of time for computation of
>>>>>>>>>>>>>>>> solution. But
>>>>>>>>>>>>>>>> I am currently configuring without debugging symbols and shall
>>>>>>>>>>>>>>>> let you
>>>>>>>>>>>>>>>> know what that yields.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On a similar note, is there something extra that needs to be 
>>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> make use of multi-core machines while using MPI ? I am not sure
>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>> this is even related to PETSc but could be an MPI configuration
>>>>>>>>>>>>>>>> option
>>>>>>>>>>>>>>>> that maybe either I or the configure process is missing. All
>>>>>>>>>>>>>>>> ideas are
>>>>>>>>>>>>>>>> much appreciated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation.
>>>>>>>>>>>>>>> On most
>>>>>>>>>>>>>>> cheap multicore machines, there is a single memory bus, and thus
>>>>>>>>>>>>>>> using more
>>>>>>>>>>>>>>> cores gains you very little extra performance. I still suspect 
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> are not
>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>> running in parallel, because you usually see a small speedup. 
>>>>>>>>>>>>>>> That
>>>>>>>>>>>>>>> is why I
>>>>>>>>>>>>>>> suggested looking at -log_summary since it tells you how many
>>>>>>>>>>>>>>> processes were
>>>>>>>>>>>>>>> run and breaks down the time.
>>>>>>>>>>>>>>> ? ?Matt
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley
>>>>>>>>>>>>>>>> <knepley at gmail.com> wrote:
>>>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan
>>>>>>>>>>>>>>>>> <vijay.m at gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am trying to configure my petsc install with an MPI
>>>>>>>>>>>>>>>>>> installation to
>>>>>>>>>>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. 
>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>> eventhough the configure/make process went through without
>>>>>>>>>>>>>>>>>> problems,
>>>>>>>>>>>>>>>>>> the scalability of the programs don't seem to reflect what I
>>>>>>>>>>>>>>>>>> expected.
>>>>>>>>>>>>>>>>>> My configure options are
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/
>>>>>>>>>>>>>>>>>> --download-mpich=1
>>>>>>>>>>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
>>>>>>>>>>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1
>>>>>>>>>>>>>>>>>> --download-hypre=1
>>>>>>>>>>>>>>>>>> --download-blacs=1 --download-scalapack=1 
>>>>>>>>>>>>>>>>>> --with-clanguage=C++
>>>>>>>>>>>>>>>>>> --download-plapack=1 --download-mumps=1 
>>>>>>>>>>>>>>>>>> --download-umfpack=yes
>>>>>>>>>>>>>>>>>> --with-debugging=1 --with-errorchecking=yes
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) For performance studies, make a build using
>>>>>>>>>>>>>>>>> --with-debugging=0
>>>>>>>>>>>>>>>>> 2) Look at -log_summary for a breakdown of performance
>>>>>>>>>>>>>>>>> ? ?Matt
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is there something else that needs to be done as part of the
>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>> process to enable a decent scaling ? I am only comparing
>>>>>>>>>>>>>>>>>> programs with
>>>>>>>>>>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking
>>>>>>>>>>>>>>>>>> approximately the
>>>>>>>>>>>>>>>>>> same time as noted from -log_summary. If it helps, I've been
>>>>>>>>>>>>>>>>>> testing
>>>>>>>>>>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a
>>>>>>>>>>>>>>>>>> custom
>>>>>>>>>>>>>>>>>> -grid parameter from command-line to control the number of
>>>>>>>>>>>>>>>>>> unknowns.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If there is something you've witnessed before in this
>>>>>>>>>>>>>>>>>> configuration or
>>>>>>>>>>>>>>>>>> if you need anything else to analyze the problem, do let me
>>>>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin 
>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>>>>> is infinitely more interesting than any results to which their
>>>>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>>>>> lead.
>>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>>> is infinitely more interesting than any results to which their
>>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>>> lead.
>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> <ex20.patch><ex20_np1.out><ex20_np2.out>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> What most experimenters take for granted before they begin their 
>>>>>>> experiments
>>>>>>> is infinitely more interesting than any results to which their 
>>>>>>> experiments
>>>>>>> lead.
>>>>>>> -- Norbert Wiener
>>>>>>>
>>>>>> <ex20_np1.out><ex20_np2.out><ex20_np4.out>
>>>>>
>>>>>
>>>
>>>
>>>
>> <basicversion_np1.out><basicversion_np2.out>
>
>

Reply via email to