houqp edited a comment on pull request #950: URL: https://github.com/apache/arrow-datafusion/pull/950#issuecomment-907984810
> I am still wondering how we could replicate the hive partitioning support for TableProviders relative to different file formats (ParquetTable, CsvFile, NdJsonFile). For raw file tables like a folder of parquet, csv and ndjson files, I think the partition discovery and pruning logic should be agnostic to both file formats and object storages. We could implement this logic as a shared module, which gets invoked in ParquetTable, CsvFile and NdJsonFile table providers to resolve the filtered down object list. Then these table providers will pass down the object paths to corresponding format specific physical execution plans. > Actually, same would apply for a HiveCatalogTable that could actually reference various file formats. It makes me feel that something is wrong with how the different abstractions are organized. This is being discussed in #133. Given that hive catalog manages a collection of databases and tables with their corresponding metadata, I think CatalogProvider is probably the right abstraction for it. The Hive catalog provider could query hive metastore for a given schema and table name. It can then construct a format specific table provider based on hive metastore response. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
