Thanks Will for this proposal! For anyone familiar with PyArrow, this idea has a clear intuitive logic to it. It provides an expedient solution to the current lack of a practical means for interchanging "unmaterialized dataframes" between different Python libraries.
To elaborate on that: If you look at how people use the Arrow Dataset API—which is implemented in the Arrow C++ library [1] and has bindings not just for Python [2] but also for Java [3] and R [4]—you'll see that Dataset is often used simply as a "virtual" variant of Table. It is used in cases when the data is larger than memory or when it is desirable to defer reading (materializing) the data into memory. So we can think of a Table as a materialized dataframe and a Dataset as an unmaterialized dataframe. That aspect of Dataset is I think what makes it most attractive as a protocol for enabling interoperability: it allows libraries to easily "speak Arrow" in cases where materializing the full data in memory upfront is impossible or undesirable. The trouble is that Dataset was not designed to serve as a general-purpose unmaterialized dataframe. For example, the PyArrow Dataset constructor [5] exposes options for specifying a list of source files and a partitioning scheme, which are irrelevant for many of the applications that Will anticipates. And some work is needed to reconcile the methods of the PyArrow Dataset object [6] with the methods of the Table object. Some methods like filter() are exposed by both and behave lazily on Datasets and eagerly on Tables, as a user might expect. But many other Table methods are not implemented for Dataset though they potentially could be, and it is unclear where we should draw the line between adding methods to Dataset vs. encouraging new scanner implementations to expose options controlling what lazy operations should be performed as they see fit. Will, I see that you've already addressed this issue to some extent in your proposal. For example, you mention that we should initially define this protocol to include only a minimal subset of the Dataset API. I agree, but I think there are some loose ends we should be careful to tie up. I strongly agree with the comments made by David, Weston, and Dewey arguing that we should avoid any use of PyArrow expressions in this API. Expressions are an implementation detail of PyArrow, not a part of the Arrow standard. It would be much safer for the initial version of this protocol to not define *any* methods/arguments that take expressions. This will allow us to take some more time to finish up the Substrait expression implementation work that is underway [7][8], then introduce Substrait-based expressions in a latter version of this protocol. This approach will better position this protocol to be implemented in other languages besides Python. Another concern I have is that we have not fully explained why we want to use Dataset instead of RecordBatchReader [9] as the basis of this protocol. I would like to see an explanation of why RecordBatchReader is not sufficient for this. RecordBatchReader seems like another possible way to represent "unmaterialized dataframes" and there are some parallels between RecordBatch/RecordBatchReader and Fragment/Dataset. We should help developers and users understand why Arrow needs both of these. Thanks Will for your thoughtful prose explanations about this proposed API. After we arrive at a decision about this, I think we should reproduce some of these explanations in docs, blog posts, cookbook recipes, etc. because there is some important nuance here that will be important for integrators of this API to understand. Ian [1] https://arrow.apache.org/docs/cpp/api/dataset.html [2] https://arrow.apache.org/docs/python/dataset.html [3] https://arrow.apache.org/docs/java/dataset.html [4] https://arrow.apache.org/docs/r/articles/dataset.html [5] https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset [6] https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html [7] https://github.com/apache/arrow/issues/33985 [8] https://github.com/apache/arrow/issues/34252 [9] https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html On Wed, Jun 21, 2023 at 2:09 PM Will Jones <will.jones...@gmail.com> wrote: > > Hello Arrow devs, > > I have drafted a PR defining an experimental protocol which would allow > third-party libraries to imitate the PyArrow Dataset API [5]. This protocol > is intended to endorse an integration pattern that is starting to be used > in the Python ecosystem, where some libraries are providing their own > scanners with this API, while query engines are accepting these as > duck-typed objects. > > To give some background: back at the end of 2021, we collaborated with > DuckDB to be able to read datasets (an Arrow C++ concept), supporting > column selection and filter pushdown. This was accomplished by having > DuckDB manipulating Python (or R) objects to get a RecordBatchReader and > then exporting over the C Stream Interface. > > Since then, DataFusion [2] and Polars have both made similar > implementations for their Python bindings, allowing them to consume PyArrow > datasets. This has created an implicit protocol, whereby arbitrary compute > engines can push down queries into the PyArrow dataset scanner. > > Now, libraries supporting table formats including Delta Lake, Lance, and > Iceberg are looking to be able to support these engines, while bringing > their own scanners and metadata handling implementations. One possible > route is allowing them to imitate the PyArrow datasets API. > > Bringing these use cases together, I'd like to propose an experimental > protocol, made out of the minimal subset of the PyArrow Dataset API > necessary to facilitate this kind of integration. This would allow any > library to produce a scanner implementation and that arbitrary query > engines could call into. I've drafted a PR [3] and there is some background > research available in a google doc [4]. > > I've already gotten some good feedback on both, and would welcome more. > > One last point: I'd like for this to be a first step rather than a > comprehensive API. This PR focuses on making explicit a protocol that is > already in use in the ecosystem, but without much concrete definition. Once > this is established, we can use our experience from this protocol to design > something more permanent that takes advantage of newer innovations in the > Arrow ecosystem (such as the PyCapsule for C Data Interface or > Substrait for passing expressions / scan plans). I am tracking such future > improvements in [5]. > > Best, > > Will Jones > > [1] https://duckdb.org/2021/12/03/duck-arrow.html > [2] https://github.com/apache/arrow-datafusion-python/pull/9 > [3] https://github.com/apache/arrow/pull/35568 > [4] > https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1 > [5] > https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit