[GitHub] [arrow-datafusion-python] kylebrooks-8451 commented on issue #11: Support fsspec based filesystems

GitBox Tue, 26 Jul 2022 09:46:20 -0700


kylebrooks-8451 commented on issue #11:
URL: 
https://github.com/apache/arrow-datafusion-python/issues/11#issuecomment-1195726755


   @andygrove Thanks for the feedback!  No, it would be supported by fsspec 
which is already supported by PyArrow.  With the PyArrow Dataset TableProvider 
PR merged, users can use fsspec and Datasets in DataFusion now but the scanning 
is done in PyArrow.
   
   This would add the ability to use fsspec for the filesystem / ObjectStore 
and the native DataFusion TableProviders E.g., Parquet, ParquetExec for the 
scanning.
   
   There are pros / cons to each approach.  PyArrow Datasets eagerly list and 
store the partitioning and paths to the data when they are constructed.  This 
means that if the paths to backing data change at runtime, the Dataset becomes 
invalid.
   
   Native DataFusion lists the paths / partitions for every query so it handles 
changes to backing data.   However, for tables with many partitions, DataFusion 
queries take a long time to list all the partitions from the ObjectStore.
   
   For our internal project, we use Datasets for large tables with many 
partitions and the fsspec ObjectStore for smaller tables.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion-python] kylebrooks-8451 commented on issue #11: Support fsspec based filesystems

Reply via email to