Re: [bug #23618] queuing system for multi processors is not well designed.

Edward d'Auvergne Mon, 08 Jun 2015 08:34:50 -0700

Hi Troels,

Please see below:



On 27 May 2015 at 02:10, Troels E. Linnet
<no-reply.invalid-addr...@gna.org> wrote:
> URL:
>   <http://gna.org/bugs/?23618>
>
>                  Summary: queuing system for multi processors is not well
> designed.
>                  Project: relax
>             Submitted by: tlinnet
>             Submitted on: Wed 27 May 2015 12:10:57 AM UTC
>                 Category: relax's source code
> Specific analysis category: None
>                 Priority: 5 - Normal
>                 Severity: 3 - Normal
>                   Status: None
>              Assigned to: None
>          Originator Name:
>         Originator Email:
>              Open/Closed: Open
>                  Release: Repository: trunk
>          Discussion Lock: Any
>         Operating System: All systems
>
>     _______________________________________________________
>
> Details:
>
> There queuing system for multi processors appears not to be designed well.
>
> This has been detected in dispersion analysis.
> A clustered fit of 74 spins, doing 100 monte carlo simulations.
>
> The test has been where a number of multi processors is 10, with 1 CPU as
> master.
>
> The problem seems to reside in:
> multi.processor.run_queue()
> multi.multi_processor.chunk_queue()
>
> The current queuing system will take the 100 monte carlo simulations, and
> chunk them up in pieces of 10, and distribute each of these chunks to each
> CPU.
>
> Each CPU thus have 10 simulations to handle.
>
> The problem is, that not each simulations is equally fast to be solved.
> Thus, a CPU will "hang" until all simulations has finished.
> This will "block" the possibility to assign CPU power for other tasks, until
> all simulations has finished.
>
> A suggestion for a "first" fix, is not to chunk up the queue,
> but let each simulation be handled independently.
>
> In multi/processor.py
> --------------
> -        lqueue = self.chunk_queue(self.command_queue)
> -        self.run_command_queue(lqueue)
> +        #lqueue = self.chunk_queue(self.command_queue)
> +        self.run_command_queue(self.command_queue)
> -------------
>
> This does seem to improve the timing much, but give a better overview in the
> process.

This is actually a balancing act which depends on the data transfer
rate between the nodes and the per-node computation time.  For
applications where data transfer is rate limiting (either data
transfer is slow, or the calculations are relatively very fast), the
chunking is very, very useful.  This is the case for model-free
analyses on the per-residue level parallelisation.


> It appears that the queuing system can even be enhanced more.
> The list of "Running set" is not replenished before all jobs in "Running set"
> is completed.

This is not what I remember as happening.  I remember clearly seeing
the queue being replenished.  Maybe a bug has been introduced.  Or
maybe this new bug is specific to the parallelisation of Monte Carlo
simulations, and not the other parallelisations.  We need to get to
the bottom of this.


> This influences the solving time.
>
>
> ----
> Only 20 monte carlo simulations is runned for comparison.
> /usr/bin/time -p relax_multi bug.py
>
> The running time for 1 CPU, no multi processor:
> real 510.94
> user 5903.01
> sys 133.96
>
> The running time for 1 CPU, 4 multi processor:
> real 214.89
> user 1786.39
> sys 37.09
>
> The running time for 1 CPU, 10 multi processor:
> real 108.39
> user 1930.21
> sys 44.45
>
>
> The running time for 1 CPU, 4 multi processor with first fix:
> real 235.46
> user 1892.20
> sys 38.58
>
> The running time for 1 CPU, 10 multi processor with first fix
> real 110.50
> user 1957.99
> sys 43.60

What is the 'relax_multi' file?  The times with the fix look to be the
same.  I don't believe that this change is a fix though, and you
should probably revert it.  For the 4 to 10 processor 'sys' time
increase, this might be due to data transfer being a bottleneck.  I
cannot however check this yet, as I don't know how to execute the
'bug.py' script yet ;)

Cheers,

Edward

_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-devel@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel

Re: [bug #23618] queuing system for multi processors is not well designed.

Reply via email to