Re: [DISCUSS] FLIP-591: Introducing Python DataFrame API in PyFlink

Cole Bailey via dev Wed, 17 Jun 2026 02:41:29 -0700

Thanks Dian,

There is a lot of good work here that aligns with what we have been
brainstorming for a better PyFlink experience.


One ergonomic suggestion I'd love to see included is supporting pythonic
aliasing via kwargs within `.agg` similar to what is already outlined in
`with_columns` or `select`:

The example would instead look like this:

df.group_by("dept").agg(
    avg_salary=col("salary").avg(),
    headcount=col("id").count(),
)


Another nice-to-have would be flexibility in column referencing, I see 2
variations scattered throughout the FLIP:

col("age")

df["age"]


Both of these make sense, I think we should also consider supporting attr
style column references since these can be reused across the lambda or
subscript filtering examples already in the FLIP:

df.age


That would then give us this representative example:

df.group_by("dept").agg(
    avg_salary=df.salary.avg(),
    headcount=df.id.count(),
)


Cheers,
Cole



On Wed, Jun 17, 2026 at 6:12 AM Dian Fu <[email protected]> wrote:

> Hi all,
>
> I would like to start a discussion about FLIP-591: Introducing Python
> DataFrame API in PyFlink [1].
>
> This FLIP is to a sub-FLIP of the broader direction discussed in
> FLIP-577 (AI-Native Flink — An Umbrella Proposal for Multimodal Data
> Processing) [2]. This FLIP proposes a new public Python module,
> `pyflink.dataframe`, as a DataFrame-style API on top of the existing
> PyFlink Table API. The goal is not to introduce a new execution model,
> but to provide a more natural Python-facing entry point for users
> coming from the broader Python data ecosystem, while preserving Flink
> semantics and execution capabilities.
>
> The proposal focuses on:
>     - Designing a Python-friendly DataFrame API for PyFlink, including
> the API shape itself, a more user-friendly DataType design, unified
> configuration, reduced TableEnvironment boilerplate, and a practical
> multiple-sink model for end-to-end pipelines
>     - Providing ergonomic support for row-oriented Python
> transformations, including map / map_batches style operations for
> enrichment, feature engineering, and AI/ML workloads
>     - Exposing concurrency configuration so that expensive Python
> stages can be scaled independently, making it easier to build
> practical jobs directly with the DataFrame API
>     - Supporting Arrow as a first-class batch format for efficient
> interoperability with the Python ecosystem
>
> The Design Decisions section discusses the main design considerations
> behind the proposal and may be a useful place to pay extra attention
> when reviewing it.
>
> Looking forward to your feedback!
>
> Regards,
> Dian
>
> [1]
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-591*3A*Introducing*Python*DataFrame*API*in*PyFlink__;JSsrKysrKw!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSizWPE4Qo$
> [2]
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275__;!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSiab5Dp28$
>

Re: [DISCUSS] FLIP-591: Introducing Python DataFrame API in PyFlink

Reply via email to