[GitHub] [arrow-datafusion] houqp edited a comment on pull request #950: ObjectStore API to read from remote storage systems

GitBox Sun, 29 Aug 2021 20:25:52 -0700


houqp edited a comment on pull request #950:
URL: https://github.com/apache/arrow-datafusion/pull/950#issuecomment-907984810



   > I am still wondering how we could replicate the hive partitioning support 
for TableProviders relative to different file formats (ParquetTable, CsvFile, 
NdJsonFile).
   
   For raw file tables like a folder of parquet, csv and ndjson files, I think 
the partition discovery and pruning logic should be agnostic to both file 
formats and object storages. We could implement this logic as a shared module, 
which gets invoked in ParquetTable, CsvFile and NdJsonFile table providers to 
resolve the filtered down object list. Then these table providers will pass 
down the object paths to corresponding format specific physical execution plans.
   
   > Actually, same would apply for a HiveCatalogTable that could actually 
reference various file formats. It makes me feel that something is wrong with 
how the different abstractions are organized. This is being discussed in #133.
   
   Given that hive catalog manages a collection of databases and tables with 
their corresponding metadata, I think CatalogProvider is probably the  right 
abstraction for it. The Hive catalog provider could query hive metastore for a 
given schema and table name. It can then construct a format specific table 
provider based on hive metastore response.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] houqp edited a comment on pull request #950: ObjectStore API to read from remote storage systems

Reply via email to