Hi all,

I would like to start a discussion about FLIP-591: Introducing Python
DataFrame API in PyFlink [1].

This FLIP is to a sub-FLIP of the broader direction discussed in
FLIP-577 (AI-Native Flink — An Umbrella Proposal for Multimodal Data
Processing) [2]. This FLIP proposes a new public Python module,
`pyflink.dataframe`, as a DataFrame-style API on top of the existing
PyFlink Table API. The goal is not to introduce a new execution model,
but to provide a more natural Python-facing entry point for users
coming from the broader Python data ecosystem, while preserving Flink
semantics and execution capabilities.

The proposal focuses on:
    - Designing a Python-friendly DataFrame API for PyFlink, including
the API shape itself, a more user-friendly DataType design, unified
configuration, reduced TableEnvironment boilerplate, and a practical
multiple-sink model for end-to-end pipelines
    - Providing ergonomic support for row-oriented Python
transformations, including map / map_batches style operations for
enrichment, feature engineering, and AI/ML workloads
    - Exposing concurrency configuration so that expensive Python
stages can be scaled independently, making it easier to build
practical jobs directly with the DataFrame API
    - Supporting Arrow as a first-class batch format for efficient
interoperability with the Python ecosystem

The Design Decisions section discusses the main design considerations
behind the proposal and may be a useful place to pay extra attention
when reviewing it.

Looking forward to your feedback!

Regards,
Dian

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-591%3A+Introducing+Python+DataFrame+API+in+PyFlink
[2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275

Reply via email to