Moving conversation to dev@ which is more appropriate place to discuss. On Tuesday, November 1, 2022, Chang She <ch...@eto.ai> wrote:
> Hi there, > > The pyarrow dataset API is marked experimental so I'm curious if y'all > have made any decisions on it for upcoming releases. Specifically, any > thoughts on making the Scanner and things like FileSystemDataset part of > the "public API" (i.e., putting declarations in the _dataset.pxd)? It would > make it a lot easier for new data formats to be built on top of the Arrow > platform. e.g., Lance supports efficient partial reads from s3 for > limit/offset (via additional ScanOptions), but currently it's difficult to > expose the scanner to the rest of Arrow. Instead we subclass Dataset and > return a custom scanner we created. And our Dataset subclass *should* be a > FileSystemDataset subclass, but FileSystemDataset is not "public API" etc. > Happy to discuss additional details, for reference: > github.com/eto-ai/lance > > Thanks! > > Chang >