Dear Wei,

Not a lot on information to go on there... e.g. layout of the MPI processes on 
compute nodes, the interconnect and the GPFS settings... but the standout 
information appears to be:

"10X slower than local SSD, and nfs reexport of another gpfs filesystem"

"The per process IO is very slow, 4-5 MiB/s, while on ssd and nfs I got 20-40 
MiB/s"

You also not 2GB/s performance for 4MB writes, and 1.7GB/s read. That is only 
500 IOPS, I assume you'd see more with 4kB reads/writes.

I'd also note that 10x slower is kind of an intermediate number, its bad but 
not totally unproductive.

I think the likely issues are going to be around the GPFS (client) config, 
although you might also be struggling with IOPS. The fact that the NFS 
re-export trick works (allowing O/S-level lazy caching and instant re-opening 
of files) suggests that total performance is not your issue. Upping the 
pagepool and/or maxStatCache etc may just make all these issues go away.

If I picked out the right benchmark, then it is one with a 360 box size which 
is not too small... I don't know how many files comprise your particle set...

Regards,
Robert
--

Dr Robert Esnouf

University Research Lecturer,
Director of Research Computing BDI,
Head of Research Computing Core WHG,
NDM Research Computing Strategy Officer

Main office:
Room 10/028, Wellcome Centre for Human Genetics,
Old Road Campus, Roosevelt Drive, Oxford OX3 7BN, UK

Emails:
[email protected] / [email protected] / [email protected]

Tel: (+44)-1865-287783 (WHG); (+44)-1865-743689 (BDI)
 
----- Original Message -----
From: Guo, Wei ([email protected])
Date: 08/08/19 23:19
To: [email protected], [email protected], 
[email protected], [email protected]
Subject: [gpfsug-discuss] relion software using GPFS storage



Hi, Robert and Michael, 







What are the settings within relion for parallel file systems?


Sorry to bump this old threads, as I don't see any further conversation, and I 
cannot join the mailing list recently due to 

the spectrumscale.org:10000 web server error. I used to be in this mailing list 
with my previous work (email). 




The problem is I also see Relion 3 does not like GPFS. It is obscenely slow, 
slower than anything... local ssd, nfs reexport of gpfs. 

I am using the standard benchmarks from Relion 3 website. 




The mpirun -n 9 `which relion_refine_mpi` is 10X slower than local SSD, and nfs 
reexport of another gpfs filesystem. 

The latter two I can get close results (1hr25min) as compared with the publish 
results (1hr13min) on the same Intel Xeon Gold 6148 CPU @2.40GHz and 4 V100 GPU 
cards, with the same command. 

Running the same standard benchmark it takes 15-20 min for one iteration, 
should be <1.7 mins. 

The per process IO is very slow, 4-5 MiB/s, while on ssd and nfs I got 20-40 
MiB/s if watching the /proc/<PID>/io of the relion_refine processes. 




My gpfs client can see ~2GB/s when benchmarking with IOZONE, yes, 2GB/s because 
of small system, 70? drives. 




Record Size 4096 kB
O_DIRECT feature enabled
File size set to 20971520 kB
Command line used: iozone -r 4m -I -t 16 -s 20g
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 kBytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 16 processes
Each process writes a 20971520 kByte file in 4096 kByte records

Children see throughput for 16 initial writers = 1960218.38 kB/sec
Parent sees throughput for 16 initial writers = 1938463.07 kB/sec
Min throughput per process =  120415.66 kB/sec 
Max throughput per process =  123652.07 kB/sec
Avg throughput per process =  122513.65 kB/sec
Min xfer = 20426752.00 kB

Children see throughput for 16 readers = 1700354.00 kB/sec
Parent sees throughput for 16 readers = 1700046.71 kB/sec
Min throughput per process =  104587.73 kB/sec 
Max throughput per process =  108182.84 kB/sec
Avg throughput per process =  106272.12 kB/sec
Min xfer = 20275200.00 kB



The --no_parallel_disk_io is even worse. --only_do_unfinished_movies does not 
help much. 




Please advise.




Thanks




Wei Guo

Computational Engineer, 

St Jude Children's Research Hospital

[email protected]









Dear Michael,

There are settings within relion for parallel file systems, you should check 
they are enabled if you have SS underneath.

Otherwise, check which version of relion and then try to understand the problem 
that is being analysed a little more.

If the box size is very small and the internal symmetry low then the user may 
read 100,000s of small "picked particle" files for each iteration opening and 
closing the files each time.

I believe that relion3 has some facility for extracting these small particles 
from the larger raw images and that is more SS-friendly. Alternatively, the 
size of the set of picked particles is often only in 50GB range and so staging 
to one or more local machines is quite feasible...

Hope one of those suggestions helps.
Regards,
Robert

--

Dr Robert Esnouf 

University Research Lecturer, 
Director of Research Computing BDI, 
Head of Research Computing Core WHG, 
NDM Research Computing Strategy Officer 

Main office: 
Room 10/028, Wellcome Centre for Human Genetics, 
Old Road Campus, Roosevelt Drive, Oxford OX3 7BN, UK 

Emails: 
robert at strubi.ox.ac.uk / robert at well.ox.ac.uk / robert.esnouf at 
bdi.ox.ac.uk 

Tel:   (+44)-1865-287783 (WHG); (+44)-1865-743689 (BDI)
 

-----Original Message-----
From: "Michael Holliday" <michael.holliday at crick.ac.uk>
To: gpfsug-discuss at spectrumscale.org
Date: 27/02/19 12:21
Subject: [gpfsug-discuss] relion software using GPFS storage


Hi All,
 
We’ve recently had an issue where a job on our client GPFS cluster caused out 
main storage to go extremely slowly.   The job was running relion using MPI 
(https://www2.mrc-lmb.cam.ac.uk/relion/index.php?title=Main_Page)
 
It caused waiters across the cluster, and caused the load to spike on NSDS on 
at a time.  When the spike ended on one NSD, it immediately started on another. 
 
There were no obvious errors in the logs and the issues cleared immediately 
after the job was cancelled. 
 
Has anyone else see any issues with relion using GPFS storage?
 
Michael
 
Michael Holliday RITTech MBCS
Senior HPC & Research Data Systems Engineer | eMedLab Operations Team
Scientific Computing STP | The Francis Crick Institute
1, Midland Road | London | NW1 1AT | United Kingdom
Tel: 0203 796 3167
 
The Francis Crick Institute Limited is a registered charity in England and 
Wales no. 1140062 and a company registered in England and Wales no. 06885462, 
with its registered office at 1 Midland Road London NW1 1AT
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss






Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to