Re: [petsc-dev] PETSc Meeting errata

Jakub Kruzik via petsc-dev Sat, 15 Jun 2019 02:55:07 -0700


On 6/15/19 12:46 AM, Hapla Vaclav wrote:

On 14 Jun 2019, at 21:53, Jakub Kruzik <[email protected]<mailto:[email protected]>> wrote:
The problem is that you need to write the file with an optimal stripecount/size in the first place. An unaware user who just usessomething like cp will end up with the default stripe count which isusually 1.
Sure. This is clear I guess. I should add that it can be a bitchallenging to "defeat" the linux page cache. E.g. writing a file andreading it right away can result in ridiculously high read rate as itis actually read from RAM :-)

As far as I know, Lustre does not use the linux page cache (on theserver-side). Since version 2.9 it has a server-side cache, but that issupposed to be used for small files only. You can try to use lfs ladvise-a dontneed <file>, but there is no guarantee that if the file is in thecache, it will be cleared. Seehttp://doc.lustre.org/lustre_manual.xhtml#idm140012896072288

What I'm doing to cope with both issues, I always
1) remove data.striped.h5
2) set the stripe settings to the non-existing data.striped.h5, whichcreates new data.striped.h5 with zero size3) copy the file over from original data.h5 stored somewhere else tothat data.striped.h5
For large files, you should just set the stripe count to the numberof OSTs. Your results seem to support this.
Sure. Would be cool to have some clear limit for "large" ;-) But inthese case it's definitely better to overshoot the number of stripesrather than underestimate.

Agreed. I would say a large file is of a size where you actually carehow fast you are reading :)

For the small mesh and 64 nodes, you are reading just 2 MiB perprocess. I think that collective I/O should give you a significantimprovement.
OK, I'm giving it another shot now when the results withnon-collective look credible. I'm curious about that "significant" ;-)
But even if you are right, it's kind of tricky to say when this toggleshould be turned on, or even decide it automatically in petsc...

Note that the default number of aggregators is usually equal to thenumber of OSTs (or stripe count?). I would try setting cb_nodes to amultiple of the number of OSTs close to the number of nodes used.

Also, it would be interesting to know what performance you get from asingle process reading from a single OST. I think you should be ableto get 0.5-2.5 GiB/s which is what you are getting from 36 OSTs (~70MiB/s per OST).
Wait, if you look at the table, it's a bit outdated (before Atlanta),sorry for confusion. The new graphs on slide 18 show the rate ofapprox. 10.5/3.5 = 3 GiB/s for the 128M mesh.
Here are graphs showing load time for 3 different stripe counts andseveral different cpu counts.
128M elements: https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY
256M elements: https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz

For the 256M one I got up to ~4.5 GiB/s.
It's slowing down with growing number of cpus. I wonder whether itcould be further improved, but it's not a big deal for now.

For 12k processes, you are trying to read less than 2 MiB by eachprocess, and each OST has more than 340 clients. In this case, youshould read on a subset of processes and then distribute - effectivelywhat should collective I/O do, if the settings are correct.

BTW, since you also used Salomon for testing, I found some old testsI did there with pure MPI I/O, and I was able to get 18.5 GiB/s readfor 1 GiB file on 108 processes / 54 nodes, 54 OSTs, 4 MiB stripe.
OK, but it's probably not a good time to try to reproduce these justnow. The current greeting message:
Planned Salomon /Scratch Maintanance From 2019-06-18 09:00 Till2019-06-21 13:00
                            (2019-06-11 08:58:35)
We plan to upgrade Lustre stack. We hope to resolve some performanceissues
with SCRATCH.


Thanks,
Vaclav
Best,

Jakub


On 6/14/19 12:31 PM, Hapla Vaclav via petsc-dev wrote:
I take back one thing I mentioned in my talk in Atlanta. I think Isaid that Lustre striping does not really influence the readperformance. With my latest results in hand, I must point out thisis not true. I might have been confused by some former Piz DaintLustre performance issues and/or HDF5 library issues I mentioned.
Here are my latest slides from PASC19.
https://polybox.ethz.ch/index.php/s/PPZLSyZOKo3UXPS
On slide 18, there is some comparison for different stripe settings.I can now see a speed-up of ~4 for 1 vs 12 stripes (which isactually the number of cores per node) for the mesh with 128Melements. The times are very similar for 8 and 64 computation nodes.
Toby, could you maybe forward this message to the meeting attendees?I don't want to leave anybody confused.
Thanks,
Vaclav

Re: [petsc-dev] PETSc Meeting errata

Reply via email to