Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Ian Cook Fri, 23 Jun 2023 10:48:13 -0700

Thanks Will for this proposal!

For anyone familiar with PyArrow, this idea has a clear intuitive
logic to it. It provides an expedient solution to the current lack of
a practical means for interchanging "unmaterialized dataframes"
between different Python libraries.

To elaborate on that: If you look at how people use the Arrow Dataset
API—which is implemented in the Arrow C++ library [1] and has bindings
not just for Python [2] but also for Java [3] and R [4]—you'll see
that Dataset is often used simply as a "virtual" variant of Table. It
is used in cases when the data is larger than memory or when it is
desirable to defer reading (materializing) the data into memory.

So we can think of a Table as a materialized dataframe and a Dataset
as an unmaterialized dataframe. That aspect of Dataset is I think what
makes it most attractive as a protocol for enabling interoperability:
it allows libraries to easily "speak Arrow" in cases where
materializing the full data in memory upfront is impossible or
undesirable.

The trouble is that Dataset was not designed to serve as a
general-purpose unmaterialized dataframe. For example, the PyArrow
Dataset constructor [5] exposes options for specifying a list of
source files and a partitioning scheme, which are irrelevant for many
of the applications that Will anticipates. And some work is needed to
reconcile the methods of the PyArrow Dataset object [6] with the
methods of the Table object. Some methods like filter() are exposed by
both and behave lazily on Datasets and eagerly on Tables, as a user
might expect. But many other Table methods are not implemented for
Dataset though they potentially could be, and it is unclear where we
should draw the line between adding methods to Dataset vs. encouraging
new scanner implementations to expose options controlling what lazy
operations should be performed as they see fit.

Will, I see that you've already addressed this issue to some extent in
your proposal. For example, you mention that we should initially
define this protocol to include only a minimal subset of the Dataset
API. I agree, but I think there are some loose ends we should be
careful to tie up. I strongly agree with the comments made by David,
Weston, and Dewey arguing that we should avoid any use of PyArrow
expressions in this API. Expressions are an implementation detail of
PyArrow, not a part of the Arrow standard. It would be much safer for
the initial version of this protocol to not define *any*
methods/arguments that take expressions. This will allow us to take
some more time to finish up the Substrait expression implementation
work that is underway [7][8], then introduce Substrait-based
expressions in a latter version of this protocol. This approach will
better position this protocol to be implemented in other languages
besides Python.

Another concern I have is that we have not fully explained why we want
to use Dataset instead of RecordBatchReader [9] as the basis of this
protocol. I would like to see an explanation of why RecordBatchReader
is not sufficient for this. RecordBatchReader seems like another
possible way to represent "unmaterialized dataframes" and there are
some parallels between RecordBatch/RecordBatchReader and
Fragment/Dataset. We should help developers and users understand why
Arrow needs both of these.

Thanks Will for your thoughtful prose explanations about this proposed
API. After we arrive at a decision about this, I think we should
reproduce some of these explanations in docs, blog posts, cookbook
recipes, etc. because there is some important nuance here that will be
important for integrators of this API to understand.

Ian

[1] https://arrow.apache.org/docs/cpp/api/dataset.html
[2] https://arrow.apache.org/docs/python/dataset.html
[3] https://arrow.apache.org/docs/java/dataset.html
[4] https://arrow.apache.org/docs/r/articles/dataset.html
[5] 
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
[6] https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
[7] https://github.com/apache/arrow/issues/33985
[8] https://github.com/apache/arrow/issues/34252
[9] 
https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html

On Wed, Jun 21, 2023 at 2:09 PM Will Jones <will.jones...@gmail.com> wrote:
>
> Hello Arrow devs,
>
> I have drafted a PR defining an experimental protocol which would allow
> third-party libraries to imitate the PyArrow Dataset API [5]. This protocol
> is intended to endorse an integration pattern that is starting to be used
> in the Python ecosystem, where some libraries are providing their own
> scanners with this API, while query engines are accepting these as
> duck-typed objects.
>
> To give some background: back at the end of 2021, we collaborated with
> DuckDB to be able to read datasets (an Arrow C++ concept), supporting
> column selection and filter pushdown. This was accomplished by having
> DuckDB manipulating Python (or R) objects to get a RecordBatchReader and
> then exporting over the C Stream Interface.
>
> Since then, DataFusion [2] and Polars have both made similar
> implementations for their Python bindings, allowing them to consume PyArrow
> datasets. This has created an implicit protocol, whereby arbitrary compute
> engines can push down queries into the PyArrow dataset scanner.
>
> Now, libraries supporting table formats including Delta Lake, Lance, and
> Iceberg are looking to be able to support these engines, while bringing
> their own scanners and metadata handling implementations. One possible
> route is allowing them to imitate the PyArrow datasets API.
>
> Bringing these use cases together, I'd like to propose an experimental
> protocol, made out of the minimal subset of the PyArrow Dataset API
> necessary to facilitate this kind of integration. This would allow any
> library to produce a scanner implementation and that arbitrary query
> engines could call into. I've drafted a PR [3] and there is some background
> research available in a google doc [4].
>
> I've already gotten some good feedback on both, and would welcome more.
>
> One last point: I'd like for this to be a first step rather than a
> comprehensive API. This PR focuses on making explicit a protocol that is
> already in use in the ecosystem, but without much concrete definition. Once
> this is established, we can use our experience from this protocol to design
> something more permanent that takes advantage of newer innovations in the
> Arrow ecosystem (such as the PyCapsule for C Data Interface or
> Substrait for passing expressions / scan plans). I am tracking such future
> improvements in [5].
>
> Best,
>
> Will Jones
>
> [1] https://duckdb.org/2021/12/03/duck-arrow.html
> [2] https://github.com/apache/arrow-datafusion-python/pull/9
> [3] https://github.com/apache/arrow/pull/35568
> [4]
> https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
> [5]
> https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Reply via email to