alamb commented on pull request #1010: URL: https://github.com/apache/arrow-datafusion/pull/1010#issuecomment-920922893
TLDR: I wonder "if DataFusion planning was `async` would you be able to implement the table format as you would like"? I really like the idea of splitting the details of reading formats from the "metadata management" (called "table format" in the linked doc) so that users of DataFusion can extend DataFusion to manage metadata in ways suited to their use. I thought, however, we were headed towards a slightly different abstraction where would still have a `ParquetReader` that didn't use `Path` / `File` directly, but instead would use the [`ObjectStore`](https://github.com/apache/arrow-datafusion/blob/6f531807176e49110c33a01722014552024fa412/datafusion/src/datasource/object_store/mod.rs#L77) abstraction recently added by @yjshen. In terms of the document, https://docs.google.com/document/d/1Bd4-PLLH-pHj0BquMDsJ6cVr_awnxTuvwNJuWsTHxAQ/edit?usp=sharing, my biggest takeaway was that: 1) Any `TableProvider` needs to provide the schema (columns, types) without almost any information from the query to create a `LogicalPlan` 2) The details of how `statistics()` and `scan()` work will be different based on: - The actual file format (parquet, json, etc) - The cost of accessing the statistics and creating requested `ExecutionPlan`s (e.g. a bunch of remote files on S3 vs cached in memory copies) At the moment, the user has to synchronously create a `TableProvider` for each named table in the query and (synchronously) provide the Schema, as well as synchronously provide statistics. The creating of `ExecutionPlans` via `execute` is already `async`. This, I am wondering if the fetching of the `TableProvider` and statistics creation was `async`, would that be sufficient? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org