Hi all,
I would like to start a discussion about FLIP-591: Introducing Python
DataFrame API in PyFlink [1].
This FLIP is to a sub-FLIP of the broader direction discussed in
FLIP-577 (AI-Native Flink — An Umbrella Proposal for Multimodal Data
Processing) [2]. This FLIP proposes a new public Python module,
`pyflink.dataframe`, as a DataFrame-style API on top of the existing
PyFlink Table API. The goal is not to introduce a new execution model,
but to provide a more natural Python-facing entry point for users
coming from the broader Python data ecosystem, while preserving Flink
semantics and execution capabilities.
The proposal focuses on:
- Designing a Python-friendly DataFrame API for PyFlink, including
the API shape itself, a more user-friendly DataType design, unified
configuration, reduced TableEnvironment boilerplate, and a practical
multiple-sink model for end-to-end pipelines
- Providing ergonomic support for row-oriented Python
transformations, including map / map_batches style operations for
enrichment, feature engineering, and AI/ML workloads
- Exposing concurrency configuration so that expensive Python
stages can be scaled independently, making it easier to build
practical jobs directly with the DataFrame API
- Supporting Arrow as a first-class batch format for efficient
interoperability with the Python ecosystem
The Design Decisions section discusses the main design considerations
behind the proposal and may be a useful place to pay extra attention
when reviewing it.
Looking forward to your feedback!
Regards,
Dian
[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-591%3A+Introducing+Python+DataFrame+API+in+PyFlink
[2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275