On 6/15/19 12:46 AM, Hapla Vaclav wrote:
On 14 Jun 2019, at 21:53, Jakub Kruzik <[email protected]
<mailto:[email protected]>> wrote:
The problem is that you need to write the file with an optimal stripe
count/size in the first place. An unaware user who just uses
something like cp will end up with the default stripe count which is
usually 1.
Sure. This is clear I guess. I should add that it can be a bit
challenging to "defeat" the linux page cache. E.g. writing a file and
reading it right away can result in ridiculously high read rate as it
is actually read from RAM :-)
As far as I know, Lustre does not use the linux page cache (on the
server-side). Since version 2.9 it has a server-side cache, but that is
supposed to be used for small files only. You can try to use lfs ladvise
-a dontneed <file>, but there is no guarantee that if the file is in the
cache, it will be cleared. See
http://doc.lustre.org/lustre_manual.xhtml#idm140012896072288
What I'm doing to cope with both issues, I always
1) remove data.striped.h5
2) set the stripe settings to the non-existing data.striped.h5, which
creates new data.striped.h5 with zero size
3) copy the file over from original data.h5 stored somewhere else to
that data.striped.h5
For large files, you should just set the stripe count to the number
of OSTs. Your results seem to support this.
Sure. Would be cool to have some clear limit for "large" ;-) But in
these case it's definitely better to overshoot the number of stripes
rather than underestimate.
Agreed. I would say a large file is of a size where you actually care
how fast you are reading :)
For the small mesh and 64 nodes, you are reading just 2 MiB per
process. I think that collective I/O should give you a significant
improvement.
OK, I'm giving it another shot now when the results with
non-collective look credible. I'm curious about that "significant" ;-)
But even if you are right, it's kind of tricky to say when this toggle
should be turned on, or even decide it automatically in petsc...
Note that the default number of aggregators is usually equal to the
number of OSTs (or stripe count?). I would try setting cb_nodes to a
multiple of the number of OSTs close to the number of nodes used.
Also, it would be interesting to know what performance you get from a
single process reading from a single OST. I think you should be able
to get 0.5-2.5 GiB/s which is what you are getting from 36 OSTs (~70
MiB/s per OST).
Wait, if you look at the table, it's a bit outdated (before Atlanta),
sorry for confusion. The new graphs on slide 18 show the rate of
approx. 10.5/3.5 = 3 GiB/s for the 128M mesh.
Here are graphs showing load time for 3 different stripe counts and
several different cpu counts.
128M elements: https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY
256M elements: https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz
For the 256M one I got up to ~4.5 GiB/s.
It's slowing down with growing number of cpus. I wonder whether it
could be further improved, but it's not a big deal for now.
For 12k processes, you are trying to read less than 2 MiB by each
process, and each OST has more than 340 clients. In this case, you
should read on a subset of processes and then distribute - effectively
what should collective I/O do, if the settings are correct.
BTW, since you also used Salomon for testing, I found some old tests
I did there with pure MPI I/O, and I was able to get 18.5 GiB/s read
for 1 GiB file on 108 processes / 54 nodes, 54 OSTs, 4 MiB stripe.
OK, but it's probably not a good time to try to reproduce these just
now. The current greeting message:
Planned Salomon /Scratch Maintanance From 2019-06-18 09:00 Till
2019-06-21 13:00
(2019-06-11 08:58:35)
We plan to upgrade Lustre stack. We hope to resolve some performance
issues
with SCRATCH.
Thanks,
Vaclav
Best,
Jakub
On 6/14/19 12:31 PM, Hapla Vaclav via petsc-dev wrote:
I take back one thing I mentioned in my talk in Atlanta. I think I
said that Lustre striping does not really influence the read
performance. With my latest results in hand, I must point out this
is not true. I might have been confused by some former Piz Daint
Lustre performance issues and/or HDF5 library issues I mentioned.
Here are my latest slides from PASC19.
https://polybox.ethz.ch/index.php/s/PPZLSyZOKo3UXPS
On slide 18, there is some comparison for different stripe settings.
I can now see a speed-up of ~4 for 1 vs 12 stripes (which is
actually the number of cores per node) for the mesh with 128M
elements. The times are very similar for 8 and 64 computation nodes.
Toby, could you maybe forward this message to the meeting attendees?
I don't want to leave anybody confused.
Thanks,
Vaclav