[GitHub] [arrow-datafusion] rdettai commented on pull request #811: Add support for reading remote storage systems

GitBox Wed, 11 Aug 2021 04:01:46 -0700


rdettai commented on pull request #811:
URL: https://github.com/apache/arrow-datafusion/pull/811#issuecomment-896727736



   When I talk about a catalog, I mean:
   - schema
   - list of files with statistics. 
   
   Ideally, you should be able to compose different ways of getting the list of 
files with different ways of reading them. For example, when reading from S3, 
you might get the list of file from s3.list_objects, but also from Hive Catalog 
or from Delta.
   
   Regarding early materialization of the file list: the usecase I have in mind 
is the bucket with partitioned data. Most queries will be able to use only a 
fraction of the files. For example if you generate 24 files per day, even if 
you have 3 years of parquet in your bucket, queries that target only 3 days of 
data should work fine (once partitions are detected properly). But if you need 
to open all the files when registering the table, you won't scale to buckets 
with large numbers of files (in this example you would need to open 24k files 
first). I understand that for now partition pruning is not implemented, but as 
you created a structure called `PartitionedFile`, I guess that this would have 
been the next step, no? 😉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] rdettai commented on pull request #811: Add support for reading remote storage systems

Reply via email to