We wrote a collector for Amazon EFA 
<https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html> which is a 
high-speed network interface similar to Infiniband. 

This interface is used for tightly coupled applications in HPC (WRF, Ansys 
Fluent, Gromacs...) and distributed ML (think LLMs like BLOOM, OPT... or 
Diffusion based models like Stable diffusion). The metrics are used for 
optimization and troubleshooting of these computational workloads. The 
collector we wrote is based on the one used by Infiniband and involved 
changes on ProcFS as well as EFA metrics are exposed similarly.

*Would the team be open for us to create a PR to add a new collector for 
this network interface? *



You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 

Reply via email to