[GitHub] [arrow-datafusion] alamb commented on pull request #1010: Reorganize table providers by table format

GitBox Thu, 16 Sep 2021 06:53:17 -0700


alamb commented on pull request #1010:
URL: https://github.com/apache/arrow-datafusion/pull/1010#issuecomment-920922893



   TLDR: I wonder  "if DataFusion planning was `async` would you be able to 
implement the table format as you would like"?
   
   I really like the idea of splitting the details of reading formats from the 
"metadata management" (called "table format" in the linked doc) so that users 
of DataFusion can extend DataFusion to manage metadata in ways suited to their 
use.
   
   I thought, however, we were headed towards a slightly different abstraction 
where would still have a `ParquetReader` that didn't use `Path` / `File` 
directly, but instead would use the 
[`ObjectStore`](https://github.com/apache/arrow-datafusion/blob/6f531807176e49110c33a01722014552024fa412/datafusion/src/datasource/object_store/mod.rs#L77)
  abstraction recently added by @yjshen.
   
   In terms of the document, 
https://docs.google.com/document/d/1Bd4-PLLH-pHj0BquMDsJ6cVr_awnxTuvwNJuWsTHxAQ/edit?usp=sharing,
 my biggest takeaway was that:
   
   1) Any `TableProvider` needs to provide the schema (columns, types) without 
almost any information from the query to create a `LogicalPlan`
   
   2) The details of how `statistics()` and `scan()` work will be different 
based on:
   - The actual file format (parquet, json, etc)
   - The cost of accessing the statistics and creating requested 
`ExecutionPlan`s (e.g. a bunch of remote files on S3 vs cached in memory copies)
   
   At the moment, the user has to synchronously create a `TableProvider` for 
each named table in the query and (synchronously) provide the Schema, as well 
as synchronously provide statistics. The creating of `ExecutionPlans` via 
`execute` is already `async`.
   
   This, I am wondering if the fetching of the `TableProvider` and statistics 
creation was `async`, would that be sufficient?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on pull request #1010: Reorganize table providers by table format

Reply via email to