Hi everyone,
I'd like to start a discussion on FLIP-592: First-Class Accelerator Resource Support [1]. As noted in FLIP-577 (AI-Native Flink) [2], with the growth of AI-oriented workloads, accelerators (GPUs, NPUs, TPUs) have become essential resources for Flink jobs. The existing ExternalResource framework (FLIP-108) [3] provides a generic abstraction, but lacks dedicated accelerator APIs and resource allocation strategies optimized for accelerator utilization. This FLIP proposes elevating accelerators to first-class resources with end-to-end native support. The proposal focuses on: - Dedicated accelerator resource declaration APIs and configurations, with K8s/YARN deployment integration - A new resource allocation strategy that supports heterogeneous TM provisioning, isolating CPU-only and accelerator-equipped TMs to improve accelerator utilization - An SPI-based framework for device discovery and metrics collection, with built-in support for Nvidia GPUs All new capabilities are optional and fully backward compatible. Looking forward to your feedback! Best regards, Yi Zhang [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-592%3A+First-Class+Accelerator+Resource+Support [2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275 [3] https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
