I tweaked your scripts a bunch so that I could run a bunch of different
variations on my cluster.
I have lots of jobs queued up (I have 29 nodes in my cluster -- 3 have died
over time); they'll take a bunch of time to execute.
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
3204131 jenkins alltoall jsquyres PD 0:00 8 (Resources)
3204132 jenkins alltoall jsquyres PD 0:00 8 (Resources)
3204133 jenkins barrier jsquyres PD 0:00 8 (Resources)
3204134 jenkins bcast jsquyres PD 0:00 8 (Resources)
3204135 jenkins gather jsquyres PD 0:00 8 (Resources)
3204136 jenkins reduce jsquyres PD 0:00 8 (Resources)
3204137 jenkins reduce_s jsquyres PD 0:00 8 (Resources)
3204138 jenkins reduce_s jsquyres PD 0:00 8 (Resources)
3204139 jenkins scatter jsquyres PD 0:00 8 (Resources)
3204140 jenkins allgathe jsquyres PD 0:00 8 (Resources)
3204141 jenkins allgathe jsquyres PD 0:00 8 (Resources)
3204142 jenkins allreduc jsquyres PD 0:00 8 (Resources)
3204143 jenkins alltoall jsquyres PD 0:00 8 (Resources)
3204144 jenkins alltoall jsquyres PD 0:00 8 (Resources)
3204145 jenkins barrier jsquyres PD 0:00 8 (Resources)
3204146 jenkins bcast jsquyres PD 0:00 8 (Resources)
3204147 jenkins gather jsquyres PD 0:00 8 (Resources)
3204148 jenkins reduce jsquyres PD 0:00 8 (Resources)
3204149 jenkins reduce_s jsquyres PD 0:00 8 (Resources)
3204150 jenkins reduce_s jsquyres PD 0:00 8 (Resources)
3204151 jenkins scatter jsquyres PD 0:00 8 (Resources)
3204152 jenkins allgathe jsquyres PD 0:00 16 (Resources)
3204153 jenkins allgathe jsquyres PD 0:00 16 (Resources)
3204154 jenkins allreduc jsquyres PD 0:00 16 (Resources)
3204155 jenkins alltoall jsquyres PD 0:00 16 (Resources)
3204156 jenkins alltoall jsquyres PD 0:00 16 (Resources)
3204157 jenkins barrier jsquyres PD 0:00 16 (Resources)
3204158 jenkins bcast jsquyres PD 0:00 16 (Resources)
3204159 jenkins gather jsquyres PD 0:00 16 (Resources)
3204160 jenkins reduce jsquyres PD 0:00 16 (Resources)
3204161 jenkins reduce_s jsquyres PD 0:00 16 (Resources)
3204162 jenkins reduce_s jsquyres PD 0:00 16 (Resources)
3204163 jenkins scatter jsquyres PD 0:00 16 (Resources)
3204164 jenkins allgathe jsquyres PD 0:00 16 (Resources)
3204165 jenkins allgathe jsquyres PD 0:00 16 (Resources)
3204166 jenkins allreduc jsquyres PD 0:00 16 (Resources)
3204167 jenkins alltoall jsquyres PD 0:00 16 (Resources)
3204168 jenkins alltoall jsquyres PD 0:00 16 (Resources)
3204169 jenkins barrier jsquyres PD 0:00 16 (Resources)
3204170 jenkins bcast jsquyres PD 0:00 16 (Resources)
3204171 jenkins gather jsquyres PD 0:00 16 (Resources)
3204172 jenkins reduce jsquyres PD 0:00 16 (Resources)
3204173 jenkins reduce_s jsquyres PD 0:00 16 (Resources)
3204174 jenkins reduce_s jsquyres PD 0:00 16 (Resources)
3204175 jenkins scatter jsquyres PD 0:00 16 (Resources)
3204176 jenkins allgathe jsquyres PD 0:00 16 (Resources)
3204177 jenkins allgathe jsquyres PD 0:00 16 (Resources)
3204178 jenkins allreduc jsquyres PD 0:00 16 (Resources)
3204179 jenkins alltoall jsquyres PD 0:00 16 (Resources)
3204180 jenkins alltoall jsquyres PD 0:00 16 (Resources)
3204181 jenkins barrier jsquyres PD 0:00 16 (Resources)
3204182 jenkins bcast jsquyres PD 0:00 16 (Resources)
3204183 jenkins gather jsquyres PD 0:00 16 (Resources)
3204184 jenkins reduce jsquyres PD 0:00 16 (Resources)
3204185 jenkins reduce_s jsquyres PD 0:00 16 (Resources)
3204186 jenkins reduce_s jsquyres PD 0:00 16 (Resources)
3204187 jenkins scatter jsquyres PD 0:00 16 (Resources)
3204188 jenkins allgathe jsquyres PD 0:00 29 (Resources)
3204189 jenkins allgathe jsquyres PD 0:00 29 (Resources)
3204190 jenkins allreduc jsquyres PD 0:00 29 (Resources)
3204191 jenkins alltoall jsquyres PD 0:00 29 (Resources)
3204192 jenkins alltoall jsquyres PD 0:00 29 (Resources)
3204193 jenkins barrier jsquyres PD 0:00 29 (Resources)
3204194 jenkins bcast jsquyres PD 0:00 29 (Resources)
3204195 jenkins gather jsquyres PD 0:00 29 (Resources)
3204196 jenkins reduce jsquyres PD 0:00 29 (Resources)
3204197 jenkins reduce_s jsquyres PD 0:00 29 (Resources)
3204198 jenkins reduce_s jsquyres PD 0:00 29 (Resources)
3204199 jenkins scatter jsquyres PD 0:00 29 (Resources)
3204200 jenkins allgathe jsquyres PD 0:00 29 (Resources)
3204201 jenkins allgathe jsquyres PD 0:00 29 (Resources)
3204202 jenkins allreduc jsquyres PD 0:00 29 (Resources)
3204203 jenkins alltoall jsquyres PD 0:00 29 (Resources)
3204204 jenkins alltoall jsquyres PD 0:00 29 (Resources)
3204205 jenkins barrier jsquyres PD 0:00 29 (Resources)
3204206 jenkins bcast jsquyres PD 0:00 29 (Resources)
3204207 jenkins gather jsquyres PD 0:00 29 (Resources)
3204208 jenkins reduce jsquyres PD 0:00 29 (Resources)
3204209 jenkins reduce_s jsquyres PD 0:00 29 (Resources)
3204210 jenkins reduce_s jsquyres PD 0:00 29 (Resources)
3204211 jenkins scatter jsquyres PD 0:00 29 (Resources)
3204212 jenkins allgathe jsquyres PD 0:00 29 (Resources)
3204213 jenkins allgathe jsquyres PD 0:00 29 (Resources)
3204214 jenkins allreduc jsquyres PD 0:00 29 (Resources)
3204215 jenkins alltoall jsquyres PD 0:00 29 (Resources)
3204216 jenkins alltoall jsquyres PD 0:00 29 (Resources)
3204217 jenkins barrier jsquyres PD 0:00 29 (Resources)
3204218 jenkins bcast jsquyres PD 0:00 29 (Resources)
3204219 jenkins gather jsquyres PD 0:00 29 (Resources)
3204220 jenkins reduce jsquyres PD 0:00 29 (Resources)
3204221 jenkins reduce_s jsquyres PD 0:00 29 (Resources)
3204222 jenkins reduce_s jsquyres PD 0:00 29 (Resources)
3204223 jenkins scatter jsquyres PD 0:00 29 (Resources)
3204128 jenkins allgathe jsquyres R 5:10 8 mpi[004-011]
3204129 jenkins allgathe jsquyres R 5:10 8 mpi[016-023]
3204130 jenkins allreduc jsquyres R 5:10 8 mpi[024-031]
> On Apr 13, 2020, at 6:35 PM, Zhang, William via devel
> <[email protected]> wrote:
>
> Hello all,
>
> I have created a —with-slurm option when running (See updated README). In
> order to set new defaults for collective algorithms, we will need data from
> those who wish to provide it. We have created the following package that
> allows for collecting data:
> https://github.com/open-mpi/ompi-collectives-tuning
>
> Please run the package as soon as possible. Details on how to run are in the
> README.md. If data collection fails, the output of the analyze script (either
> analyze.sh.o* for SGE or the ouput of ./run_and_analyze if using slurm) will
> report "Error parsing <filename>. Data format doesn't match. Exiting..”.
> Please make sure data collection succeeds and a decision file is written
> entirely.
>
> Please provide me with either the output directory or if it’s inconvenient to
> share this data, provide me a list of optimal switchover points at different
> message sizes for each algorithm (This can be in the form of the
> output/decision.file which only contains switchover points and no specific
> performance numbers)
>
> Thanks,
> William Zhang
--
Jeff Squyres
[email protected]