Re: [DISCUSS] FLIP-591: Introducing Python DataFrame API in PyFlink

Yong Fang Wed, 17 Jun 2026 20:07:37 -0700

Thanks Fu Dian for initiating this discussion and +1 for the total design.

I have several comments for this flip:


1. In multi-modal data processing scenarios, the operations are mostly
reading and writing files, images, audio and video. From the current API
perspective, these operations can only be added manually in read_custom and
write_custom, how are the read/write APIs for these types designed?

2. Currently, batch_size is controlled by row count in map_batches.
However, the per-row size of multimodal data varies dramatically — a single
4K image can be up to 20 MB, while a piece of text may only be 100 B.
Splitting batches purely by row count may lead to OOM. Should we support
splitting batches by memory budget too?

3. GPU computing is a common requirement in multimodal processing, I don't
seem to see any related information in this set of APIs such as
map/map_batches and ect. How can we set GPU/CPU resources and
specifications for them and udfs?

4. In addition, users may need to load local models within map/map_batches
for data processing. The current APIs only support the callback format.
Should class types also be supported? This way, I/O and other operations
only need to be executed once for a class instance.

Best,
FangYong

On Thu, Jun 18, 2026 at 7:13 AM Zander Matheson <[email protected]>
wrote:

> Very nice proposal, Dian Fu, thank you for putting this together!
>
> It feels like it supersedes FLIP-541
> <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=378473217
> >
> since
> it will solve many of the problems we initially discussed related to making
> Flink more Pythonic.
>
> I had some questions and comments on the flip shared below. Overall I
> really like the design you have proposed and look forward to collaborating
> hopefully!
>
> *Pandas/PySpark/Polars API Parity*
> What are your thoughts on parity with some of the established APIs. I see
> it is mentioned in the FLIP to both lower the barrier to entry for Python
> users who are familiar with Dataframe APIs, while also having a non-goal of
> providing full compatibility with Pandas, Polars etc. Below are a couple of
> dataframe api common that we could potentially more closely align to.
>
>    1. dropna and fillna instead of drop_null/fill_null. Python does not
>    have a Null concept, so it might make sense to keep the drop_na used in
>    other dataframe libraries?
>    2. show(), .display(). How to peek at the data? This is common patter in
>    development
>    3. sort(by, descending=False). This boolean flag is opposite what is
>    used in other libraries.
>    4. map() in Dataframe land is an element-wise computation, do we want to
>    break with that? I believe apply() is more similar. That being said,
> map as
>    intended from map reduce is more akin to the row-wise model shown. More
> a
>    question than anything else.
>    5. Index - I didn’t see any mention to indices in the document, is this
>    something we are explicitly avoiding? I have often use indices for
>    selecting groups of rows or columns for manipulation.
>    6. In_place - this may not be possible with the execution style, but is
>    there a possibility of having in_place booleans for things like unique
> or
>    sort where the output can either be done in place or the output needs
> to be
>    assigned to a new dataframe.
>    7. to_json/from_json and .values().
>    8. Properties - Do we want to add some of the niceties of spark/pandas
>    here if possible? df.shape, df.info, df.describe
>
>
> *Execution Model*I like the idea of having triggers for the execution on
> write and materialization (to_pandas etc.). This is probably the meat of
> the problem for creating a dataframe API that feels like the other commonly
> used dataframe APIs. The balance between write and providing the
> statement_set to make the writes coordinated seems nice. But is there also
> potentially a world where we want to expose the ability to arbitrarily
> trigger an execution? Say for Exploratory Data Analysis in a notebook? This
> seems possible with to_pandas and to_list etc., but maybe having a lower
> level primitive that could be called could be a good escape hatch?
>
>
> *Exploratory Data Analysis*One of the biggest use cases for dataframes is
> exploratory data analysis. Is this something we want to encourage with this
> FLIP? It might make sense to add some of the EDA methods for that purpose.
> See above, show/display and execution trigger.
>
> *Common Use Cases*
> Related to the EDA note above, I am curious what you envision as the most
> common use cases for this API.
>
> Thanks,
>
> Zander
>
> On Wed, Jun 17, 2026 at 2:42 AM Cole Bailey via dev <[email protected]>
> wrote:
>
> > Thanks Dian,
> >
> > There is a lot of good work here that aligns with what we have been
> > brainstorming for a better PyFlink experience.
> >
> > One ergonomic suggestion I'd love to see included is supporting pythonic
> > aliasing via kwargs within `.agg` similar to what is already outlined in
> > `with_columns` or `select`:
> >
> > The example would instead look like this:
> >
> > df.group_by("dept").agg(
> >     avg_salary=col("salary").avg(),
> >     headcount=col("id").count(),
> > )
> >
> >
> > Another nice-to-have would be flexibility in column referencing, I see 2
> > variations scattered throughout the FLIP:
> >
> > col("age")
> >
> > df["age"]
> >
> >
> > Both of these make sense, I think we should also consider supporting attr
> > style column references since these can be reused across the lambda or
> > subscript filtering examples already in the FLIP:
> >
> > df.age
> >
> >
> > That would then give us this representative example:
> >
> > df.group_by("dept").agg(
> >     avg_salary=df.salary.avg(),
> >     headcount=df.id.count(),
> > )
> >
> >
> > Cheers,
> > Cole
> >
> >
> >
> > On Wed, Jun 17, 2026 at 6:12 AM Dian Fu <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I would like to start a discussion about FLIP-591: Introducing Python
> > > DataFrame API in PyFlink [1].
> > >
> > > This FLIP is to a sub-FLIP of the broader direction discussed in
> > > FLIP-577 (AI-Native Flink — An Umbrella Proposal for Multimodal Data
> > > Processing) [2]. This FLIP proposes a new public Python module,
> > > `pyflink.dataframe`, as a DataFrame-style API on top of the existing
> > > PyFlink Table API. The goal is not to introduce a new execution model,
> > > but to provide a more natural Python-facing entry point for users
> > > coming from the broader Python data ecosystem, while preserving Flink
> > > semantics and execution capabilities.
> > >
> > > The proposal focuses on:
> > >     - Designing a Python-friendly DataFrame API for PyFlink, including
> > > the API shape itself, a more user-friendly DataType design, unified
> > > configuration, reduced TableEnvironment boilerplate, and a practical
> > > multiple-sink model for end-to-end pipelines
> > >     - Providing ergonomic support for row-oriented Python
> > > transformations, including map / map_batches style operations for
> > > enrichment, feature engineering, and AI/ML workloads
> > >     - Exposing concurrency configuration so that expensive Python
> > > stages can be scaled independently, making it easier to build
> > > practical jobs directly with the DataFrame API
> > >     - Supporting Arrow as a first-class batch format for efficient
> > > interoperability with the Python ecosystem
> > >
> > > The Design Decisions section discusses the main design considerations
> > > behind the proposal and may be a useful place to pay extra attention
> > > when reviewing it.
> > >
> > > Looking forward to your feedback!
> > >
> > > Regards,
> > > Dian
> > >
> > > [1]
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-591*3A*Introducing*Python*DataFrame*API*in*PyFlink__;JSsrKysrKw!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSizWPE4Qo$
> > > [2]
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275__;!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSiab5Dp28$
> > >
> >
>

Re: [DISCUSS] FLIP-591: Introducing Python DataFrame API in PyFlink

Reply via email to