Re: [DISCUSS] FLIP-591: Introducing Python DataFrame API in PyFlink

Dian Fu Thu, 18 Jun 2026 05:52:36 -0700

Hi Cole,

Thanks for the feedback. Both suggestions make sense for me, and I've
updated the FLIP.


1. kwargs aliasing in .agg: see section "6. DataFrame — Aggregation"
2. Attribute-style column references: see section "15. DataFrame —
Column Access".

Thanks again!

Regards,
Dian

On Thu, Jun 18, 2026 at 11:06 AM Yong Fang <[email protected]> wrote:
>
> Thanks Fu Dian for initiating this discussion and +1 for the total design.
>
> I have several comments for this flip:
>
> 1. In multi-modal data processing scenarios, the operations are mostly
> reading and writing files, images, audio and video. From the current API
> perspective, these operations can only be added manually in read_custom and
> write_custom, how are the read/write APIs for these types designed?
>
> 2. Currently, batch_size is controlled by row count in map_batches.
> However, the per-row size of multimodal data varies dramatically — a single
> 4K image can be up to 20 MB, while a piece of text may only be 100 B.
> Splitting batches purely by row count may lead to OOM. Should we support
> splitting batches by memory budget too?
>
> 3. GPU computing is a common requirement in multimodal processing, I don't
> seem to see any related information in this set of APIs such as
> map/map_batches and ect. How can we set GPU/CPU resources and
> specifications for them and udfs?
>
> 4. In addition, users may need to load local models within map/map_batches
> for data processing. The current APIs only support the callback format.
> Should class types also be supported? This way, I/O and other operations
> only need to be executed once for a class instance.
>
> Best,
> FangYong
>
> On Thu, Jun 18, 2026 at 7:13 AM Zander Matheson <[email protected]>
> wrote:
>
> > Very nice proposal, Dian Fu, thank you for putting this together!
> >
> > It feels like it supersedes FLIP-541
> > <
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=378473217
> > >
> > since
> > it will solve many of the problems we initially discussed related to making
> > Flink more Pythonic.
> >
> > I had some questions and comments on the flip shared below. Overall I
> > really like the design you have proposed and look forward to collaborating
> > hopefully!
> >
> > *Pandas/PySpark/Polars API Parity*
> > What are your thoughts on parity with some of the established APIs. I see
> > it is mentioned in the FLIP to both lower the barrier to entry for Python
> > users who are familiar with Dataframe APIs, while also having a non-goal of
> > providing full compatibility with Pandas, Polars etc. Below are a couple of
> > dataframe api common that we could potentially more closely align to.
> >
> >    1. dropna and fillna instead of drop_null/fill_null. Python does not
> >    have a Null concept, so it might make sense to keep the drop_na used in
> >    other dataframe libraries?
> >    2. show(), .display(). How to peek at the data? This is common patter in
> >    development
> >    3. sort(by, descending=False). This boolean flag is opposite what is
> >    used in other libraries.
> >    4. map() in Dataframe land is an element-wise computation, do we want to
> >    break with that? I believe apply() is more similar. That being said,
> > map as
> >    intended from map reduce is more akin to the row-wise model shown. More
> > a
> >    question than anything else.
> >    5. Index - I didn’t see any mention to indices in the document, is this
> >    something we are explicitly avoiding? I have often use indices for
> >    selecting groups of rows or columns for manipulation.
> >    6. In_place - this may not be possible with the execution style, but is
> >    there a possibility of having in_place booleans for things like unique
> > or
> >    sort where the output can either be done in place or the output needs
> > to be
> >    assigned to a new dataframe.
> >    7. to_json/from_json and .values().
> >    8. Properties - Do we want to add some of the niceties of spark/pandas
> >    here if possible? df.shape, df.info, df.describe
> >
> >
> > *Execution Model*I like the idea of having triggers for the execution on
> > write and materialization (to_pandas etc.). This is probably the meat of
> > the problem for creating a dataframe API that feels like the other commonly
> > used dataframe APIs. The balance between write and providing the
> > statement_set to make the writes coordinated seems nice. But is there also
> > potentially a world where we want to expose the ability to arbitrarily
> > trigger an execution? Say for Exploratory Data Analysis in a notebook? This
> > seems possible with to_pandas and to_list etc., but maybe having a lower
> > level primitive that could be called could be a good escape hatch?
> >
> >
> > *Exploratory Data Analysis*One of the biggest use cases for dataframes is
> > exploratory data analysis. Is this something we want to encourage with this
> > FLIP? It might make sense to add some of the EDA methods for that purpose.
> > See above, show/display and execution trigger.
> >
> > *Common Use Cases*
> > Related to the EDA note above, I am curious what you envision as the most
> > common use cases for this API.
> >
> > Thanks,
> >
> > Zander
> >
> > On Wed, Jun 17, 2026 at 2:42 AM Cole Bailey via dev <[email protected]>
> > wrote:
> >
> > > Thanks Dian,
> > >
> > > There is a lot of good work here that aligns with what we have been
> > > brainstorming for a better PyFlink experience.
> > >
> > > One ergonomic suggestion I'd love to see included is supporting pythonic
> > > aliasing via kwargs within `.agg` similar to what is already outlined in
> > > `with_columns` or `select`:
> > >
> > > The example would instead look like this:
> > >
> > > df.group_by("dept").agg(
> > >     avg_salary=col("salary").avg(),
> > >     headcount=col("id").count(),
> > > )
> > >
> > >
> > > Another nice-to-have would be flexibility in column referencing, I see 2
> > > variations scattered throughout the FLIP:
> > >
> > > col("age")
> > >
> > > df["age"]
> > >
> > >
> > > Both of these make sense, I think we should also consider supporting attr
> > > style column references since these can be reused across the lambda or
> > > subscript filtering examples already in the FLIP:
> > >
> > > df.age
> > >
> > >
> > > That would then give us this representative example:
> > >
> > > df.group_by("dept").agg(
> > >     avg_salary=df.salary.avg(),
> > >     headcount=df.id.count(),
> > > )
> > >
> > >
> > > Cheers,
> > > Cole
> > >
> > >
> > >
> > > On Wed, Jun 17, 2026 at 6:12 AM Dian Fu <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to start a discussion about FLIP-591: Introducing Python
> > > > DataFrame API in PyFlink [1].
> > > >
> > > > This FLIP is to a sub-FLIP of the broader direction discussed in
> > > > FLIP-577 (AI-Native Flink — An Umbrella Proposal for Multimodal Data
> > > > Processing) [2]. This FLIP proposes a new public Python module,
> > > > `pyflink.dataframe`, as a DataFrame-style API on top of the existing
> > > > PyFlink Table API. The goal is not to introduce a new execution model,
> > > > but to provide a more natural Python-facing entry point for users
> > > > coming from the broader Python data ecosystem, while preserving Flink
> > > > semantics and execution capabilities.
> > > >
> > > > The proposal focuses on:
> > > >     - Designing a Python-friendly DataFrame API for PyFlink, including
> > > > the API shape itself, a more user-friendly DataType design, unified
> > > > configuration, reduced TableEnvironment boilerplate, and a practical
> > > > multiple-sink model for end-to-end pipelines
> > > >     - Providing ergonomic support for row-oriented Python
> > > > transformations, including map / map_batches style operations for
> > > > enrichment, feature engineering, and AI/ML workloads
> > > >     - Exposing concurrency configuration so that expensive Python
> > > > stages can be scaled independently, making it easier to build
> > > > practical jobs directly with the DataFrame API
> > > >     - Supporting Arrow as a first-class batch format for efficient
> > > > interoperability with the Python ecosystem
> > > >
> > > > The Design Decisions section discusses the main design considerations
> > > > behind the proposal and may be a useful place to pay extra attention
> > > > when reviewing it.
> > > >
> > > > Looking forward to your feedback!
> > > >
> > > > Regards,
> > > > Dian
> > > >
> > > > [1]
> > > >
> > >
> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-591*3A*Introducing*Python*DataFrame*API*in*PyFlink__;JSsrKysrKw!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSizWPE4Qo$
> > > > [2]
> > > >
> > >
> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275__;!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSiab5Dp28$
> > > >
> > >
> >

Re: [DISCUSS] FLIP-591: Introducing Python DataFrame API in PyFlink

Reply via email to