Thanks Dian,
There is a lot of good work here that aligns with what we have been
brainstorming for a better PyFlink experience.
One ergonomic suggestion I'd love to see included is supporting pythonic
aliasing via kwargs within `.agg` similar to what is already outlined in
`with_columns` or `select`:
The example would instead look like this:
df.group_by("dept").agg(
avg_salary=col("salary").avg(),
headcount=col("id").count(),
)
Another nice-to-have would be flexibility in column referencing, I see 2
variations scattered throughout the FLIP:
col("age")
df["age"]
Both of these make sense, I think we should also consider supporting attr
style column references since these can be reused across the lambda or
subscript filtering examples already in the FLIP:
df.age
That would then give us this representative example:
df.group_by("dept").agg(
avg_salary=df.salary.avg(),
headcount=df.id.count(),
)
Cheers,
Cole
On Wed, Jun 17, 2026 at 6:12 AM Dian Fu <[email protected]> wrote:
> Hi all,
>
> I would like to start a discussion about FLIP-591: Introducing Python
> DataFrame API in PyFlink [1].
>
> This FLIP is to a sub-FLIP of the broader direction discussed in
> FLIP-577 (AI-Native Flink — An Umbrella Proposal for Multimodal Data
> Processing) [2]. This FLIP proposes a new public Python module,
> `pyflink.dataframe`, as a DataFrame-style API on top of the existing
> PyFlink Table API. The goal is not to introduce a new execution model,
> but to provide a more natural Python-facing entry point for users
> coming from the broader Python data ecosystem, while preserving Flink
> semantics and execution capabilities.
>
> The proposal focuses on:
> - Designing a Python-friendly DataFrame API for PyFlink, including
> the API shape itself, a more user-friendly DataType design, unified
> configuration, reduced TableEnvironment boilerplate, and a practical
> multiple-sink model for end-to-end pipelines
> - Providing ergonomic support for row-oriented Python
> transformations, including map / map_batches style operations for
> enrichment, feature engineering, and AI/ML workloads
> - Exposing concurrency configuration so that expensive Python
> stages can be scaled independently, making it easier to build
> practical jobs directly with the DataFrame API
> - Supporting Arrow as a first-class batch format for efficient
> interoperability with the Python ecosystem
>
> The Design Decisions section discusses the main design considerations
> behind the proposal and may be a useful place to pay extra attention
> when reviewing it.
>
> Looking forward to your feedback!
>
> Regards,
> Dian
>
> [1]
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-591*3A*Introducing*Python*DataFrame*API*in*PyFlink__;JSsrKysrKw!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSizWPE4Qo$
> [2]
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275__;!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSiab5Dp28$
>