Re: pyarrow dataset API

Weston Pace Wed, 02 Nov 2022 16:55:45 -0700

FileSystemDataset is part of the public API (and in a pxd file[1]).  I
would agree it's fair to say that pyarrow datasets are no longer
experimental.

> Instead we subclass Dataset and return a custom scanner we created. And our 
> Dataset subclass *should* be a FileSystemDataset subclass, but 
> FileSystemDataset is not "public API" etc.

Hmm, perhaps the problem is that dataset & scanner are not considered
to be an extension point, compared to something like FileFormat or
Fragment.  I would argue that even something like FileSystemDataset
probably doesn't need to be a child class of Dataset but is more a
combination of:

 * Dataset discovery (this is already a standalone utility)
 * FileFragment
 * Dataset write (this is already a static method)

A "dataset" then is more of a container (e.g. a collection of
fragments and a schema) and not something that actually has
functionality.

Setting this aside for a moment, I have been slowly working on new
scanning functionality[2].  I've been hoping to support offset/limit
natively within the new scanner work, but haven't had time to get to
it yet. Some other goals:

 * Better support for cancellation and error handling (currently, on
errors, we continue to read files even after the scan completes, which
can lead to further errors / crashes)
 * Formally defining schema evolution, with the hope of allowing
others to add more sophisticated approaches here (e.g. parquet column
id for integration with iceberg)
 * More control over scheduling and, eventually, a better ordering of
scan tasks to facilitate in-order traversal
 * Native limit / offset which could run in parallel and still prevent
over-read (except for some potential over-read of metadata when
running in parallel) for nice formats (e.g. not CSV or JSON)
 * Simpler scan options (e.g. projection is very confusing in the scan
options today, ability to specify readahead limits in bytes, etc.)
 * Simpler implementation (switch from merge generator to async task scheduler)

Unfortunately, it may be a breaking change.  I think I can adapt the
existing fragment API onto the new fragment API[3] (which is hopefully
simpler).  But your changes to scanner/dataset might not map as
cleanly.  When I get close to the point of switching (right now I'm
hoping to get this wrapped up before or during the winter holidays)
I'd like to work with you to ensure we can get lance working with the
new scanner as well.

[1] 
https://github.com/apache/arrow/blob/5e53978b56aa13f9c033f2e849cc22f2aed6e2d3/python/pyarrow/includes/libarrow_dataset.pxd#L236
[2] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scan_node.cc
[3] 
https://github.com/apache/arrow/blob/5e53978b56aa13f9c033f2e849cc22f2aed6e2d3/cpp/src/arrow/dataset/dataset.h#L162

On Tue, Nov 1, 2022 at 7:10 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Moving conversation to dev@ which is more appropriate place to discuss.
>
> On Tuesday, November 1, 2022, Chang She <ch...@eto.ai> wrote:
>
> > Hi there,
> >
> > The pyarrow dataset API is marked experimental so I'm curious if y'all
> > have made any decisions on it for upcoming releases. Specifically, any
> > thoughts on making the Scanner and things like FileSystemDataset part of
> > the "public API" (i.e., putting declarations in the _dataset.pxd)? It would
> > make it a lot easier for new data formats to be built on top of the Arrow
> > platform. e.g., Lance supports efficient partial reads from s3 for
> > limit/offset (via additional ScanOptions), but currently it's difficult to
> > expose the scanner to the rest of Arrow. Instead we subclass Dataset and
> > return a custom scanner we created. And our Dataset subclass *should* be a
> > FileSystemDataset subclass, but FileSystemDataset is not "public API" etc.
> > Happy to discuss additional details, for reference:
> > github.com/eto-ai/lance
> >
> > Thanks!
> >
> > Chang
> >

Re: pyarrow dataset API

Reply via email to