On 14 Jun 2019, at 21:53, Jakub Kruzik 
<[email protected]<mailto:[email protected]>> wrote:

The problem is that you need to write the file with an optimal stripe 
count/size in the first place. An unaware user who just uses something like cp 
will end up with the default stripe count which is usually 1.

Sure. This is clear I guess. I should add that it can be a bit challenging to 
"defeat" the linux page cache. E.g. writing a file and reading it right away 
can result in ridiculously high read rate as it is actually read from RAM :-)

What I'm doing to cope with both issues, I always
1) remove data.striped.h5
2) set the stripe settings to the non-existing data.striped.h5, which creates 
new data.striped.h5 with zero size
3) copy the file over from original data.h5 stored somewhere else to that 
data.striped.h5


For large files, you should just set the stripe count to the number of OSTs. 
Your results seem to support this.

Sure. Would be cool to have some clear limit for "large" ;-) But in these case 
it's definitely better to overshoot the number of stripes rather than 
underestimate.


For the small mesh and 64 nodes, you are reading just 2 MiB per process. I 
think that collective I/O should give you a significant improvement.

OK, I'm giving it another shot now when the results with non-collective look 
credible. I'm curious about that "significant" ;-)

But even if you are right, it's kind of tricky to say when this toggle should 
be turned on, or even decide it automatically in petsc...


Also, it would be interesting to know what performance you get from a single 
process reading from a single OST. I think you should be able to get 0.5-2.5 
GiB/s which is what you are getting from 36 OSTs (~70 MiB/s per OST).

Wait, if you look at the table, it's a bit outdated (before Atlanta), sorry for 
confusion. The new graphs on slide 18 show the rate of approx. 10.5/3.5 = 3 
GiB/s for the 128M mesh.

Here are graphs showing load time for 3 different stripe counts and several 
different cpu counts.
128M elements: https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY
256M elements: https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz

For the 256M one I got up to ~4.5 GiB/s.

It's slowing down with growing number of cpus. I wonder whether it could be 
further improved, but it's not a big deal for now.


BTW, since you also used Salomon for testing, I found some old tests I did 
there with pure MPI I/O, and I was able to get 18.5 GiB/s read for 1 GiB file 
on 108 processes / 54 nodes, 54 OSTs, 4 MiB stripe.

OK, but it's probably not a good time to try to reproduce these just now. The 
current greeting message:

Planned Salomon /Scratch Maintanance From 2019-06-18 09:00 Till 2019-06-21 13:00
                            (2019-06-11 08:58:35)

We plan to upgrade Lustre stack. We hope to resolve some performance issues
with SCRATCH.


Thanks,
Vaclav


Best,

Jakub


On 6/14/19 12:31 PM, Hapla Vaclav via petsc-dev wrote:
I take back one thing I mentioned in my talk in Atlanta. I think I said that 
Lustre striping does not really influence the read performance. With my latest 
results in hand, I must point out this is not true. I might have been confused 
by some former Piz Daint Lustre performance issues and/or HDF5 library issues I 
mentioned.

Here are my latest slides from PASC19.
https://polybox.ethz.ch/index.php/s/PPZLSyZOKo3UXPS

On slide 18, there is some comparison for different stripe settings. I can now 
see a speed-up of ~4 for 1 vs 12 stripes (which is actually the number of cores 
per node) for the mesh with 128M elements. The times are very similar for 8 and 
64 computation nodes.

Toby, could you maybe forward this message to the meeting attendees? I don't 
want to leave anybody confused.

Thanks,
Vaclav

Reply via email to