kylebrooks-8451 commented on issue #11: URL: https://github.com/apache/arrow-datafusion-python/issues/11#issuecomment-1195726755
@andygrove Thanks for the feedback! No, it would be supported by fsspec which is already supported by PyArrow. With the PyArrow Dataset TableProvider PR merged, users can use fsspec and Datasets in DataFusion now but the scanning is done in PyArrow. This would add the ability to use fsspec for the filesystem / ObjectStore and the native DataFusion TableProviders E.g., Parquet, ParquetExec for the scanning. There are pros / cons to each approach. PyArrow Datasets eagerly list and store the partitioning and paths to the data when they are constructed. This means that if the paths to backing data change at runtime, the Dataset becomes invalid. Native DataFusion lists the paths / partitions for every query so it handles changes to backing data. However, for tables with many partitions, DataFusion queries take a long time to list all the partitions from the ObjectStore. For our internal project, we use Datasets for large tables with many partitions and the fsspec ObjectStore for smaller tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
