Hi Cole, Thanks for the feedback. Both suggestions make sense for me, and I've updated the FLIP.
1. kwargs aliasing in .agg: see section "6. DataFrame — Aggregation" 2. Attribute-style column references: see section "15. DataFrame — Column Access". Thanks again! Regards, Dian On Thu, Jun 18, 2026 at 11:06 AM Yong Fang <[email protected]> wrote: > > Thanks Fu Dian for initiating this discussion and +1 for the total design. > > I have several comments for this flip: > > 1. In multi-modal data processing scenarios, the operations are mostly > reading and writing files, images, audio and video. From the current API > perspective, these operations can only be added manually in read_custom and > write_custom, how are the read/write APIs for these types designed? > > 2. Currently, batch_size is controlled by row count in map_batches. > However, the per-row size of multimodal data varies dramatically — a single > 4K image can be up to 20 MB, while a piece of text may only be 100 B. > Splitting batches purely by row count may lead to OOM. Should we support > splitting batches by memory budget too? > > 3. GPU computing is a common requirement in multimodal processing, I don't > seem to see any related information in this set of APIs such as > map/map_batches and ect. How can we set GPU/CPU resources and > specifications for them and udfs? > > 4. In addition, users may need to load local models within map/map_batches > for data processing. The current APIs only support the callback format. > Should class types also be supported? This way, I/O and other operations > only need to be executed once for a class instance. > > Best, > FangYong > > On Thu, Jun 18, 2026 at 7:13 AM Zander Matheson <[email protected]> > wrote: > > > Very nice proposal, Dian Fu, thank you for putting this together! > > > > It feels like it supersedes FLIP-541 > > < > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=378473217 > > > > > since > > it will solve many of the problems we initially discussed related to making > > Flink more Pythonic. > > > > I had some questions and comments on the flip shared below. Overall I > > really like the design you have proposed and look forward to collaborating > > hopefully! > > > > *Pandas/PySpark/Polars API Parity* > > What are your thoughts on parity with some of the established APIs. I see > > it is mentioned in the FLIP to both lower the barrier to entry for Python > > users who are familiar with Dataframe APIs, while also having a non-goal of > > providing full compatibility with Pandas, Polars etc. Below are a couple of > > dataframe api common that we could potentially more closely align to. > > > > 1. dropna and fillna instead of drop_null/fill_null. Python does not > > have a Null concept, so it might make sense to keep the drop_na used in > > other dataframe libraries? > > 2. show(), .display(). How to peek at the data? This is common patter in > > development > > 3. sort(by, descending=False). This boolean flag is opposite what is > > used in other libraries. > > 4. map() in Dataframe land is an element-wise computation, do we want to > > break with that? I believe apply() is more similar. That being said, > > map as > > intended from map reduce is more akin to the row-wise model shown. More > > a > > question than anything else. > > 5. Index - I didn’t see any mention to indices in the document, is this > > something we are explicitly avoiding? I have often use indices for > > selecting groups of rows or columns for manipulation. > > 6. In_place - this may not be possible with the execution style, but is > > there a possibility of having in_place booleans for things like unique > > or > > sort where the output can either be done in place or the output needs > > to be > > assigned to a new dataframe. > > 7. to_json/from_json and .values(). > > 8. Properties - Do we want to add some of the niceties of spark/pandas > > here if possible? df.shape, df.info, df.describe > > > > > > *Execution Model*I like the idea of having triggers for the execution on > > write and materialization (to_pandas etc.). This is probably the meat of > > the problem for creating a dataframe API that feels like the other commonly > > used dataframe APIs. The balance between write and providing the > > statement_set to make the writes coordinated seems nice. But is there also > > potentially a world where we want to expose the ability to arbitrarily > > trigger an execution? Say for Exploratory Data Analysis in a notebook? This > > seems possible with to_pandas and to_list etc., but maybe having a lower > > level primitive that could be called could be a good escape hatch? > > > > > > *Exploratory Data Analysis*One of the biggest use cases for dataframes is > > exploratory data analysis. Is this something we want to encourage with this > > FLIP? It might make sense to add some of the EDA methods for that purpose. > > See above, show/display and execution trigger. > > > > *Common Use Cases* > > Related to the EDA note above, I am curious what you envision as the most > > common use cases for this API. > > > > Thanks, > > > > Zander > > > > On Wed, Jun 17, 2026 at 2:42 AM Cole Bailey via dev <[email protected]> > > wrote: > > > > > Thanks Dian, > > > > > > There is a lot of good work here that aligns with what we have been > > > brainstorming for a better PyFlink experience. > > > > > > One ergonomic suggestion I'd love to see included is supporting pythonic > > > aliasing via kwargs within `.agg` similar to what is already outlined in > > > `with_columns` or `select`: > > > > > > The example would instead look like this: > > > > > > df.group_by("dept").agg( > > > avg_salary=col("salary").avg(), > > > headcount=col("id").count(), > > > ) > > > > > > > > > Another nice-to-have would be flexibility in column referencing, I see 2 > > > variations scattered throughout the FLIP: > > > > > > col("age") > > > > > > df["age"] > > > > > > > > > Both of these make sense, I think we should also consider supporting attr > > > style column references since these can be reused across the lambda or > > > subscript filtering examples already in the FLIP: > > > > > > df.age > > > > > > > > > That would then give us this representative example: > > > > > > df.group_by("dept").agg( > > > avg_salary=df.salary.avg(), > > > headcount=df.id.count(), > > > ) > > > > > > > > > Cheers, > > > Cole > > > > > > > > > > > > On Wed, Jun 17, 2026 at 6:12 AM Dian Fu <[email protected]> wrote: > > > > > > > Hi all, > > > > > > > > I would like to start a discussion about FLIP-591: Introducing Python > > > > DataFrame API in PyFlink [1]. > > > > > > > > This FLIP is to a sub-FLIP of the broader direction discussed in > > > > FLIP-577 (AI-Native Flink — An Umbrella Proposal for Multimodal Data > > > > Processing) [2]. This FLIP proposes a new public Python module, > > > > `pyflink.dataframe`, as a DataFrame-style API on top of the existing > > > > PyFlink Table API. The goal is not to introduce a new execution model, > > > > but to provide a more natural Python-facing entry point for users > > > > coming from the broader Python data ecosystem, while preserving Flink > > > > semantics and execution capabilities. > > > > > > > > The proposal focuses on: > > > > - Designing a Python-friendly DataFrame API for PyFlink, including > > > > the API shape itself, a more user-friendly DataType design, unified > > > > configuration, reduced TableEnvironment boilerplate, and a practical > > > > multiple-sink model for end-to-end pipelines > > > > - Providing ergonomic support for row-oriented Python > > > > transformations, including map / map_batches style operations for > > > > enrichment, feature engineering, and AI/ML workloads > > > > - Exposing concurrency configuration so that expensive Python > > > > stages can be scaled independently, making it easier to build > > > > practical jobs directly with the DataFrame API > > > > - Supporting Arrow as a first-class batch format for efficient > > > > interoperability with the Python ecosystem > > > > > > > > The Design Decisions section discusses the main design considerations > > > > behind the proposal and may be a useful place to pay extra attention > > > > when reviewing it. > > > > > > > > Looking forward to your feedback! > > > > > > > > Regards, > > > > Dian > > > > > > > > [1] > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-591*3A*Introducing*Python*DataFrame*API*in*PyFlink__;JSsrKysrKw!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSizWPE4Qo$ > > > > [2] > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275__;!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSiab5Dp28$ > > > > > > > > >
