Re: [DISCUSS] FLIP-591: Introducing Python DataFrame API in PyFlink

Dian Fu Sat, 27 Jun 2026 08:05:43 -0700

Hi Zander,

Thanks for the suggestions. The overall idea makes a lot of sense to
me. I did some investigation based on your suggestions and slightly
adjusted the API shape while keeping the same intent. Could you take a
look and let me know whether the following API makes sense to you?


```
def iter_rows(
    *,
    include_row_kind: bool = False,
    row_kind_field: str = "__row_kind__",
) -> CloseableIterator[Dict[str, Any]]

def iter_batches(
    *,
    batch_size: int = 1000,
    batch_format: Literal["pandas", "pyarrow"] = "pandas",
    include_row_kind: bool = False,
    row_kind_field: str = "__row_kind__",
) -> CloseableIterator[pandas.DataFrame | pyarrow.Table]

def take(
    n: int,
    *,
    timeout: int | None = None,
    include_row_kind: bool = False,
    row_kind_field: str = "__row_kind__",
) -> List[Dict[str, Any]]
```

That's, introducing `iter_rows()` and `iter_batches()` as the explicit
incremental result-consuming APIs, and `take()` as the bounded preview
API.

This is also aligned with APIs in other systems. Ray Dataset provides
`iter_rows()`, `iter_batches()`, and `take()`. Daft provides
`iter_rows()` for row-wise iteration and `iter_partitions()` for
partition iteration. Polars has `iter_rows()` on eager DataFrames and
`collect_batches()` for batched iteration. Spark provides similar
result-consuming APIs such as `toLocalIterator()` and `take()`,
although the names are different.

For bounded preview, I would prefer `take(n, timeout=...)` over
`fetch(n, timeout=...)`. `take(n)` is more commonly used in
distributed DataFrame/Dataset APIs, such as Spark and Ray, to mean
returning up to the first `n` rows to the client.

Regards,
Dian

On Thu, Jun 25, 2026 at 8:40 AM Zander Matheson <[email protected]> wrote:
>
> Hi Dian,
>
> Thanks for the thorough responses, this all looks great! Agreed on the
> resolved points: the drop_null/fill_null split with the Polars precedent,
> the descending= alignment with Polars/Daft/Ray, map/map_batches following
> Ray Data, no implicit index, and immutable lazy plans (no in_place). Those
> are all convincing, and totally make sense. Thanks for adding
> show/__repr__/_repr_html_/_repr_mimebundle_ and describe()!
>
> On the execution trigger, here's what I have in mind. A closeable iterator
> so you can build more finite control. I believe that Flink already has the
> machinery: TableResult.collect() returns a closeable iterator, and closing
> it cancels the job. I'd propose surfacing that at the DataFrame level as
> the documented low-level escape hatch -- roughly:
>
>     # low-level handle: submits the job, returns lazily, does NOT drain to
> completion
>     handle = df.execute()
>
>     # pull rows incrementally as they arrive
>     with handle as rows:
>         for row in itertools.islice(rows, 20):
>             ...
>     # leaving the context (or handle.close()) cancels the job
>
>     # or a bounded, time-boxed fetch that always terminates on a live
> stream:
>     preview = df.fetch(n=20, timeout="10s")   # take up to n rows or until
> deadline, then cancel
>
> The goals are essentially:
>
> - Bounded + time-boxed, so it terminates on an unbounded source instead of
> hanging.
> - Close-cancels, so a notebook peek doesn't leave a job running.
> - Incremental, so tooling can render rows as they arrive (the basis for an
> Ibis-style interactive/_repr_html_ mode).
> - Changelog-aware, so the iterator yields rows with their RowKind, so a
> helper can either show the raw +I/-U/+U/-D stream or maintain a
> materialized snapshot for updating queries (the same distinction the SQL
> client's table vs changelog result modes already make).
>
> The implementation can be ironed out later if the above pseudo-code is not
> on track.
>
> Fully agree the primary use case is AI/ML enrichment for FLIP-577. My
> interest in the low-level trigger is mostly that the build-inspect-iterate
> loop is how people will develop those enrichment pipelines, so a solid
> interactive primitive pays off there too.
>
> Thanks again -- really looking forward to collaborating on this.
>
> Best,
> Zander
>
> On Thu, Jun 18, 2026 at 7:27 AM Dian Fu <[email protected]> wrote:
>
> > Hi Zander,
> >
> > Thanks for the feedback!
> >
> > Replies inline.
> >
> > Relationship to FLIP-541: Good point. FLIP-541 took the approach of
> > evolving the Table API itself toward a more DataFrame-like style. In
> > practice, though, I found that it turned out to be hard to push
> > forward: the Table API already has a large API surface, and for
> > compatibility reasons the DataFrame-style additions would have to
> > coexist with the existing Table-style APIs. Mixing the two styles in
> > one API tends to make things more confusing for users rather than
> > less. So I think a more viable direction may be to keep the Table API
> > positioned as the API aligned with Flink's existing relational/Table
> > model and introduce a separate, dedicated DataFrame API alongside it.
> > That keeps each API internally consistent — Table API for the
> > Flink-native relational model, DataFrame API for the Python/DataFrame
> > mental model — instead of blending both into one surface.
> >
> > API parity: My overall take is that we should adopt a DataFrame-style
> > API design rather than 100% following Pandas DataFrame API. This is
> > also the direction taken by the newer generation of Python data
> > projects such as Polars, Daft, and Ray Data: they share the familiar
> > DataFrame mental model and ergonomics, but each adapts the API to its
> > own execution engine and semantics rather than mirroring pandas
> > exactly. We follow the same philosophy here — borrow the
> > widely-adopted ergonomics, but stay consistent with Flink's
> > streaming/distributed semantics.
> >
> > On the specific points:
> > 1. dropna/fillna vs drop_null/fill_null: I'd lean toward keeping
> > drop_null/fill_null. Flink's type system has a real NULL (and NaN is a
> > distinct float value) and actually Python has None, so the FLIP
> > separates them: drop_null/fill_null for NULL and drop_nan/fill_nan for
> > NaN. The pandas dropna name conflates the two. Polars made the same
> > NULL/NaN split with drop_nulls/fill_null + fill_nan, so this also has
> > precedent.
> > 2. show() / display(): Good point. Peeking at data is essential. Have
> > added show, __repr__, _repr_html_ and _repr_mimebundle_ in the section
> > "DataFrame — Conversion & Execution".
> > 3. sort(by, descending=False): I kept descending= rather than
> > ascending=. The default is ascending in all cases, matching
> > pandas/PySpark/Polars/Daft behavior — the only difference is the flag
> > name, and the flag name descending aligns with Polars/Daft/Ray. While
> > Pyspark/Pandas takes ascending as the flag name. For me, I slightly
> > tend to align with Polars/Daft/Ray since our motivation is multimodal
> > data processing and our users will be familiar with these projects.
> > However, I'm open to this.
> > 4. map() element-wise vs row-wise: You're right that pandas
> > DataFrame.map is element-wise (it's the renamed applymap), while ours
> > is row-wise. We intentionally follow the Ray Data model here (map =
> > per-row, map_batches = vectorized), which fits the enrichment/ML use
> > cases. The name `apply` isn't paired with map_batches and so I tend to
> > use `map` here.
> > 5. Index: Yes, this is deliberate — the API has no implicit
> > pandas-style index. Flink is a distributed/streaming engine with no
> > global row order, so a positional index isn't well-defined. Row/column
> > selection is done by column reference and boolean filtering instead.
> > 6. in_place: DataFrames here are lazy logical plans, not mutable
> > in-memory buffers, so every operation returns a new DataFrame —
> > there's nothing to mutate in place.
> > 7. to_json/from_json and .values(): JSON is supported for
> > sources/sinks via read_json/write_json. Is that what you want?
> > 8. Properties (df.shape/df.info/df.describe): Have added `describe()`
> > in section "13. DataFrame — Conversion & Execution". Regarding
> > `df.shape and df.info`: I left out for now since it will force data
> > processing for `df.shape` which seems not that useful and schema
> > inspection is already available via df.schema.
> >
> > Execution model / low-level trigger: I like the escape-hatch idea.
> > Today an execution can be triggered via
> > to_pandas()/to_list()/collect()/show(). Could you elaborate on what
> > specific API you have in mind?
> >
> > EDA / common use cases: The primary use cases I envision are ML/AI
> > enrichment (this is to support FLIP-577: AI-Native Flink — An Umbrella
> > Proposal for Multimodal Data Processing). EDA is supported
> > (show/describe/to_pandas) but I'd keep the heavier EDA helpers minimal
> > and grow them based on demand.
> >
> > Thanks again and looking forward to collaborating!
> >
> > Regards,
> > Dian
> >
> > On Thu, Jun 18, 2026 at 8:50 PM Dian Fu <[email protected]> wrote:
> > >
> > > Hi Cole,
> > >
> > > Thanks for the feedback. Both suggestions make sense for me, and I've
> > > updated the FLIP.
> > >
> > > 1. kwargs aliasing in .agg: see section "6. DataFrame — Aggregation"
> > > 2. Attribute-style column references: see section "15. DataFrame —
> > > Column Access".
> > >
> > > Thanks again!
> > >
> > > Regards,
> > > Dian
> > >
> > > On Thu, Jun 18, 2026 at 11:06 AM Yong Fang <[email protected]> wrote:
> > > >
> > > > Thanks Fu Dian for initiating this discussion and +1 for the total
> > design.
> > > >
> > > > I have several comments for this flip:
> > > >
> > > > 1. In multi-modal data processing scenarios, the operations are mostly
> > > > reading and writing files, images, audio and video. From the current
> > API
> > > > perspective, these operations can only be added manually in
> > read_custom and
> > > > write_custom, how are the read/write APIs for these types designed?
> > > >
> > > > 2. Currently, batch_size is controlled by row count in map_batches.
> > > > However, the per-row size of multimodal data varies dramatically — a
> > single
> > > > 4K image can be up to 20 MB, while a piece of text may only be 100 B.
> > > > Splitting batches purely by row count may lead to OOM. Should we
> > support
> > > > splitting batches by memory budget too?
> > > >
> > > > 3. GPU computing is a common requirement in multimodal processing, I
> > don't
> > > > seem to see any related information in this set of APIs such as
> > > > map/map_batches and ect. How can we set GPU/CPU resources and
> > > > specifications for them and udfs?
> > > >
> > > > 4. In addition, users may need to load local models within
> > map/map_batches
> > > > for data processing. The current APIs only support the callback format.
> > > > Should class types also be supported? This way, I/O and other
> > operations
> > > > only need to be executed once for a class instance.
> > > >
> > > > Best,
> > > > FangYong
> > > >
> > > > On Thu, Jun 18, 2026 at 7:13 AM Zander Matheson <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Very nice proposal, Dian Fu, thank you for putting this together!
> > > > >
> > > > > It feels like it supersedes FLIP-541
> > > > > <
> > > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=378473217
> > > > > >
> > > > > since
> > > > > it will solve many of the problems we initially discussed related to
> > making
> > > > > Flink more Pythonic.
> > > > >
> > > > > I had some questions and comments on the flip shared below. Overall I
> > > > > really like the design you have proposed and look forward to
> > collaborating
> > > > > hopefully!
> > > > >
> > > > > *Pandas/PySpark/Polars API Parity*
> > > > > What are your thoughts on parity with some of the established APIs.
> > I see
> > > > > it is mentioned in the FLIP to both lower the barrier to entry for
> > Python
> > > > > users who are familiar with Dataframe APIs, while also having a
> > non-goal of
> > > > > providing full compatibility with Pandas, Polars etc. Below are a
> > couple of
> > > > > dataframe api common that we could potentially more closely align to.
> > > > >
> > > > >    1. dropna and fillna instead of drop_null/fill_null. Python does
> > not
> > > > >    have a Null concept, so it might make sense to keep the drop_na
> > used in
> > > > >    other dataframe libraries?
> > > > >    2. show(), .display(). How to peek at the data? This is common
> > patter in
> > > > >    development
> > > > >    3. sort(by, descending=False). This boolean flag is opposite what
> > is
> > > > >    used in other libraries.
> > > > >    4. map() in Dataframe land is an element-wise computation, do we
> > want to
> > > > >    break with that? I believe apply() is more similar. That being
> > said,
> > > > > map as
> > > > >    intended from map reduce is more akin to the row-wise model
> > shown. More
> > > > > a
> > > > >    question than anything else.
> > > > >    5. Index - I didn’t see any mention to indices in the document,
> > is this
> > > > >    something we are explicitly avoiding? I have often use indices for
> > > > >    selecting groups of rows or columns for manipulation.
> > > > >    6. In_place - this may not be possible with the execution style,
> > but is
> > > > >    there a possibility of having in_place booleans for things like
> > unique
> > > > > or
> > > > >    sort where the output can either be done in place or the output
> > needs
> > > > > to be
> > > > >    assigned to a new dataframe.
> > > > >    7. to_json/from_json and .values().
> > > > >    8. Properties - Do we want to add some of the niceties of
> > spark/pandas
> > > > >    here if possible? df.shape, df.info, df.describe
> > > > >
> > > > >
> > > > > *Execution Model*I like the idea of having triggers for the
> > execution on
> > > > > write and materialization (to_pandas etc.). This is probably the
> > meat of
> > > > > the problem for creating a dataframe API that feels like the other
> > commonly
> > > > > used dataframe APIs. The balance between write and providing the
> > > > > statement_set to make the writes coordinated seems nice. But is
> > there also
> > > > > potentially a world where we want to expose the ability to
> > arbitrarily
> > > > > trigger an execution? Say for Exploratory Data Analysis in a
> > notebook? This
> > > > > seems possible with to_pandas and to_list etc., but maybe having a
> > lower
> > > > > level primitive that could be called could be a good escape hatch?
> > > > >
> > > > >
> > > > > *Exploratory Data Analysis*One of the biggest use cases for
> > dataframes is
> > > > > exploratory data analysis. Is this something we want to encourage
> > with this
> > > > > FLIP? It might make sense to add some of the EDA methods for that
> > purpose.
> > > > > See above, show/display and execution trigger.
> > > > >
> > > > > *Common Use Cases*
> > > > > Related to the EDA note above, I am curious what you envision as the
> > most
> > > > > common use cases for this API.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Zander
> > > > >
> > > > > On Wed, Jun 17, 2026 at 2:42 AM Cole Bailey via dev <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Thanks Dian,
> > > > > >
> > > > > > There is a lot of good work here that aligns with what we have been
> > > > > > brainstorming for a better PyFlink experience.
> > > > > >
> > > > > > One ergonomic suggestion I'd love to see included is supporting
> > pythonic
> > > > > > aliasing via kwargs within `.agg` similar to what is already
> > outlined in
> > > > > > `with_columns` or `select`:
> > > > > >
> > > > > > The example would instead look like this:
> > > > > >
> > > > > > df.group_by("dept").agg(
> > > > > >     avg_salary=col("salary").avg(),
> > > > > >     headcount=col("id").count(),
> > > > > > )
> > > > > >
> > > > > >
> > > > > > Another nice-to-have would be flexibility in column referencing, I
> > see 2
> > > > > > variations scattered throughout the FLIP:
> > > > > >
> > > > > > col("age")
> > > > > >
> > > > > > df["age"]
> > > > > >
> > > > > >
> > > > > > Both of these make sense, I think we should also consider
> > supporting attr
> > > > > > style column references since these can be reused across the
> > lambda or
> > > > > > subscript filtering examples already in the FLIP:
> > > > > >
> > > > > > df.age
> > > > > >
> > > > > >
> > > > > > That would then give us this representative example:
> > > > > >
> > > > > > df.group_by("dept").agg(
> > > > > >     avg_salary=df.salary.avg(),
> > > > > >     headcount=df.id.count(),
> > > > > > )
> > > > > >
> > > > > >
> > > > > > Cheers,
> > > > > > Cole
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 17, 2026 at 6:12 AM Dian Fu <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I would like to start a discussion about FLIP-591: Introducing
> > Python
> > > > > > > DataFrame API in PyFlink [1].
> > > > > > >
> > > > > > > This FLIP is to a sub-FLIP of the broader direction discussed in
> > > > > > > FLIP-577 (AI-Native Flink — An Umbrella Proposal for Multimodal
> > Data
> > > > > > > Processing) [2]. This FLIP proposes a new public Python module,
> > > > > > > `pyflink.dataframe`, as a DataFrame-style API on top of the
> > existing
> > > > > > > PyFlink Table API. The goal is not to introduce a new execution
> > model,
> > > > > > > but to provide a more natural Python-facing entry point for users
> > > > > > > coming from the broader Python data ecosystem, while preserving
> > Flink
> > > > > > > semantics and execution capabilities.
> > > > > > >
> > > > > > > The proposal focuses on:
> > > > > > >     - Designing a Python-friendly DataFrame API for PyFlink,
> > including
> > > > > > > the API shape itself, a more user-friendly DataType design,
> > unified
> > > > > > > configuration, reduced TableEnvironment boilerplate, and a
> > practical
> > > > > > > multiple-sink model for end-to-end pipelines
> > > > > > >     - Providing ergonomic support for row-oriented Python
> > > > > > > transformations, including map / map_batches style operations for
> > > > > > > enrichment, feature engineering, and AI/ML workloads
> > > > > > >     - Exposing concurrency configuration so that expensive Python
> > > > > > > stages can be scaled independently, making it easier to build
> > > > > > > practical jobs directly with the DataFrame API
> > > > > > >     - Supporting Arrow as a first-class batch format for
> > efficient
> > > > > > > interoperability with the Python ecosystem
> > > > > > >
> > > > > > > The Design Decisions section discusses the main design
> > considerations
> > > > > > > behind the proposal and may be a useful place to pay extra
> > attention
> > > > > > > when reviewing it.
> > > > > > >
> > > > > > > Looking forward to your feedback!
> > > > > > >
> > > > > > > Regards,
> > > > > > > Dian
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > > >
> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-591*3A*Introducing*Python*DataFrame*API*in*PyFlink__;JSsrKysrKw!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSizWPE4Qo$
> > > > > > > [2]
> > > > > > >
> > > > > >
> > > > >
> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275__;!!Ayb5sqE7!t1Rd2wTS8LFWmjw7srNbCUz4lZ5NXo__BnGTzGFeJ5BeO4T4tOCZ1hCNysc10NKuLUuegGThcr5ksMSiab5Dp28$
> > > > > > >
> > > > > >
> > > > >
> >

Re: [DISCUSS] FLIP-591: Introducing Python DataFrame API in PyFlink

Reply via email to